La práctica a desarrollar consiste en la elaboración y presentación de un informe de un proyecto de Ciencia de Datos, utilizando las técnicas aprendidas durante la asignatura, aplicadas al conjunto de datos seleccionados.
El grupo eligió trabajar en lenguage R (RStudio version 1.4.1717) y utilizar como herramienta de control de versiones GitHub. El proyecto “/practica_ml” fue creado por Isabela Ignacio (usuario IsaPires1329)y compartido con los restantes participantes del grupo Luisa Yánez (usuario lyanezgu) y Miguel García (usuario mgarciasanc2021).
Link del proyecto en GitHub: https://github.com/IsaPires1329/practica_ml.git.
library(formatR)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(tidyr)
library(missForest)
library(VIM)
library(formattable)
library(usmap)
library(cowplot)
library(corrplot)
library(MASS)
library(ggfortify)
library(nortest)
library(car)
library(lmtest)
library(PerformanceAnalytics)
library(Amelia)
library(ggthemes)
library(tidyverse)
library(tibble)
library(gridExtra)
library(ggbiplot)
library(factoextra)
library(caret)
library(ISLR)
library(rpart)
library(rpart.plot)
library(rattle)
library(tsne)
library(Rtsne)
library(class)
library(ada)
library(factoextra)
library(cluster)
library(useful)
library(mgcv)
library(xgboost)
library(randomForest)
library(kernlab)
library(pROC)
install.packages("doMC", repos = "http://R-Forge.R-project.org")
library(doMC)
library(ggpubr)
El conjunto de datos elegido por el grupo se llama “Red Wine Quality” e incluye información sobre la variantes de vino tinto dentro del “Vinho Verde” portugués analizándolo y describiéndolo a través de sus características fisicoquímicas y sensoriales.
Link del data set: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.
El conjunto de datos “Red Wine Quality” contiene 12 columnas y 1599 filas y lo obtenemos en formato .CSV.
Inicialmente se han guardado los datos en un data frame llamado “red_wine” y se ha realizado un estudio inicial sobre su contenido utilizando la función head y summary.
red_wine <- read_csv("winequality-red.csv")
head(red_wine)
## # A tibble: 6 x 12
## `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ... with 7 more variables: free sulfur dioxide <dbl>,
## # total sulfur dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Empezando ya el análisis inicial del conjunto de datos que tenemos, vemos que las 12 variables que componen los datos pueden ser descritas como:
Input variables o Varibles de entrada/predictoras (basado en pruebas fisicoquímicas):
Output variable o Variable de salida/respuesta/objetivo (basado en datos sensoriales):
El objetivo final del proyecto es conseguir llegar a un modelo que permita predecir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde” y saber si estamos ante vinos recomendables (aprobados/bebibles) o no recomendables y que se deberían evitar (suspensos/no bebibles).
Se ha decidido realizar un cambio en el nombre de las variables que aparecen en las columnas de los datos para así seguir un mismo patrón y a la vez evitar tener espacios que nos pueden llegar a dar problemas a futuro.
names(red_wine) <- c("fixed_acidity", "volatile_acidity", "citric_acid",
"residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide",
"density", "pH", "sulphates", "alcohol", "quality")
head(red_wine)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ... with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
Todas las variables input de las que disponemos en el dataset son de tipo numérico y entendemos que en principio no requieren ninguna transformación en ese sentido.
Cabría la posibilidad de tratar de transformar la variable “quality” (output) en factor para hacerla categórica en función de la calidad del vino (clasificación del vino en números enteros entre el 0 y el 10). Se podría pasar a categorizar el vino como “malo”, “normal” y “bueno”, como “apobado” o “suspenso”, o del 0 al 10 en las diferentes categorías numéricas que vienen predefinidas.
str(red_wine)
## spec_tbl_df [1,599 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
A través de la función summary empezamos comprobando que no hay datos faltantes en el data set. Por ello el grupo ha tenido que añadirlos manualmente para tratar de aproximarnos a un caso más real donde lo normal es encontrarlos y tener que lidiar con ellos.
Los datos faltantes han sido imputados exclusivamente en las columnas que no creemos que no van a servir de análisis principal para este estudio (pH y sulphates), para así intentar que la predicción que hagamos sea lo más precisa posible.
Utilizamos la librería missForest y generamos una semilla para que el resultado sea siempre el mismo.
red_wine
## # A tibble: 1,599 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # ... with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
set.seed(101)
red_wine <- bind_cols(red_wine[c(1, 2, 3, 4, 5, 6, 7, 8, 11,
12)], missForest::prodNA(red_wine[c(-1, -2, -3, -4, -5, -6,
-7, -8, -11, -12)], noNA = 0.1))
red_wine
## # A tibble: 1,599 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # ... with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Haciendo uso de la librería VIM y de la librería Amelia, analizamos la estructura que tienen nuestros datos faltantes dentro de nuestro data set para ver y entender como se distribuyen y a que variables afecta.
Se puede comprobar que la proporción de datos faltantes en estas variables es de aproximadamente 10% y hay 15 filas en que las dos variables son faltantes.
summary(aggr(red_wine, numbers = T, sortVar = T))
##
## Variables sorted by number of missings:
## Variable Count
## pH 0.10318949
## sulphates 0.09631019
## fixed_acidity 0.00000000
## volatile_acidity 0.00000000
## citric_acid 0.00000000
## residual_sugar 0.00000000
## chlorides 0.00000000
## free_sulfur_dioxide 0.00000000
## total_sulfur_dioxide 0.00000000
## density 0.00000000
## alcohol 0.00000000
## quality 0.00000000
##
## Missings per variable:
## Variable Count
## fixed_acidity 0
## volatile_acidity 0
## citric_acid 0
## residual_sugar 0
## chlorides 0
## free_sulfur_dioxide 0
## total_sulfur_dioxide 0
## density 0
## alcohol 0
## quality 0
## pH 165
## sulphates 154
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0:0:0:0:0 1295 80.9881176
## 0:0:0:0:0:0:0:0:0:0:0:1 139 8.6929331
## 0:0:0:0:0:0:0:0:0:0:1:0 150 9.3808630
## 0:0:0:0:0:0:0:0:0:0:1:1 15 0.9380863
missmap(red_wine, main = "Missing Map")
Una vez vistos por encima la estructura general de los datos y habiendo añadido los datos faltantes que nos hacian falta, pasamos a dividir el conjunto de datos en dos para diferenciar los que usaremos de entrenamiento de los que usaremos de test (viendo la cantidad de datos de la que disponemos, la distribución elegida ha sido: 20% test y 80% training). Establecemos una semilla que nos guarde de forma permanente la división que hacemos para que la distribución de los datos sea siempre la misma.
Guardamos además la partición de datos de test para ser utilizada a futuro para la validación del modelo final y pasamos a trabajar de aquí en adelante con la partición de training.
set.seed(101)
sample <- sample.int(n = nrow(red_wine), size = floor(0.8 * nrow(red_wine)),
replace = F)
train <- red_wine[sample, ]
test <- red_wine[-sample, ]
train
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 2.8 0.068
## 2 7.6 0.49 0.33 1.9 0.074
## 3 5 1.02 0.04 1.4 0.045
## 4 7.6 0.43 0.29 2.1 0.075
## 5 6.8 0.59 0.1 1.7 0.063
## 6 6.8 0.815 0 1.2 0.267
## 7 8.5 0.21 0.52 1.9 0.09
## 8 7.4 0.36 0.29 2.6 0.087
## 9 5.5 0.49 0.03 1.8 0.044
## 10 6.8 0.49 0.22 2.3 0.071
## # ... with 1,269 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
test
## # A tibble: 320 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.3 0.65 0 1.2 0.065
## 3 8.9 0.22 0.48 1.8 0.077
## 4 7.6 0.41 0.24 1.8 0.08
## 5 7.1 0.71 0 1.9 0.08
## 6 5.7 1.13 0.09 1.5 0.172
## 7 7.3 0.45 0.36 5.9 0.074
## 8 8.1 0.66 0.22 2.2 0.069
## 9 6.8 0.67 0.02 1.8 0.05
## 10 5.6 0.31 0.37 1.4 0.074
## # ... with 310 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Para la imputación de datos faltantes en las columnas “pH” y “sulphates”, se ha decidido reemplazar todos sus NAs según los valores medianos de las mismas variables.
Con la función summary se comprueba que ya no hay más datos faltantes en el data set train.
train$pH[is.na(train$pH)] <- median(train$pH, na.rm = TRUE)
train$sulphates[is.na(train$sulphates)] <- median(train$sulphates,
na.rm = TRUE)
summary(train)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.: 1.900
## Median : 7.900 Median :0.5200 Median :0.2600 Median : 2.200
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean : 2.552
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.: 2.600
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.0120 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.0710 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.0800 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.0882 Mean :15.86 Mean : 46.44 Mean :0.9968
## 3rd Qu.:0.0910 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9979
## Max. :0.6110 Max. :68.00 Max. :289.00 Max. :1.0037
## alcohol quality pH sulphates
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :0.3300
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.:0.5600
## Median :10.20 Median :6.000 Median :3.300 Median :0.6200
## Mean :10.43 Mean :5.635 Mean :3.308 Mean :0.6526
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.:0.7100
## Max. :14.90 Max. :8.000 Max. :4.010 Max. :1.9500
Analizamos como se distribuyen las diferentes variables de nuestro dataset.
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
A partir de las gráficas podemos ver que algunas de las variables están distribuidas de forma normal, y parte de las variables están sesgadas a la derecha.
La distribución de “fixed_acidity” y “volatile_acidity” es muy similar, lo que indica que hay ciertas similitudes entre los dos indicadores fisicoquímicos.
Las variables “density” y el “pH” se distribuyen normalmente, lo que indica que todos los vinos tintos tienen poca diferencia en estos dos indicadores. No se requiere por tanto transformación alguna de su distribución.
Las variables “residual_sugar”, “chlorides”, “free_sulfur_dioxide”, “total_sulfur_dioxide”, and “sulphates” están muy sesgadas, por lo qye sería conveniente transformarlas para que la distribución de sus valores fuese más homogénea. Este resultado se consigue aplicando una transformación logarítmica y normalizando de esa manera sus distribuciones:
train <- train %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
ga <- train %>%
ggplot(aes(x = Log_residual_sugar)) + geom_histogram(bins = 20,
fill = "#619CFF")
gb <- train %>%
ggplot(aes(x = Log_chlorides)) + geom_histogram(bins = 20,
fill = "#E58700")
gc <- train %>%
ggplot(aes(x = Log_free_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#00BF7D")
gd <- train %>%
ggplot(aes(x = Log_total_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#FD61D1")
ge <- train %>%
ggplot(aes(x = Log_sulphates)) + geom_histogram(bins = 20,
fill = "#B983FF")
grid.arrange(ga, gb, gc, gd, ge)
Modificamos nuestro dataset original para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
train <- train %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
train %>%
summary
## fixed_acidity volatile_acidity citric_acid density
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.9901
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.:0.9956
## Median : 7.900 Median :0.5200 Median :0.2600 Median :0.9968
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean :0.9968
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.:0.9979
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :1.0037
## alcohol quality pH Log_residual_sugar
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :-0.1054
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.: 0.6419
## Median :10.20 Median :6.000 Median :3.300 Median : 0.7885
## Mean :10.43 Mean :5.635 Mean :3.308 Mean : 0.8554
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.: 0.9555
## Max. :14.90 Max. :8.000 Max. :4.010 Max. : 2.7408
## Log_chlorides Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## Min. :-4.4228 Min. :0.000 Min. :1.792
## 1st Qu.:-2.6451 1st Qu.:1.946 1st Qu.:3.091
## Median :-2.5257 Median :2.639 Median :3.638
## Mean :-2.4980 Mean :2.545 Mean :3.604
## 3rd Qu.:-2.3969 3rd Qu.:3.045 3rd Qu.:4.127
## Max. :-0.4927 Max. :4.220 Max. :5.666
## Log_sulphates
## Min. :-1.1087
## 1st Qu.:-0.5798
## Median :-0.4780
## Mean :-0.4496
## 3rd Qu.:-0.3425
## Max. : 0.6678
Una vez realizadas las transformaciones logaritmicas oportunas sobre las 5 variables que lo requerían, volvemos a ver de forma general las distribuciones del conjunto total de variables:
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
Analizamos en detalle como se distribuye la variable de salida “quality” referente a las puntuaciones de calidad de entre 0 y 10 de los vinos.
ggplot(data = train) + geom_bar(mapping = aes(x = quality, fill = as.factor(quality))) +
labs(title = "Histograma Calidad Vino")
table(train$quality)
##
## 3 4 5 6 7 8
## 7 38 552 513 156 13
prop.table(table(train$quality))
##
## 3 4 5 6 7 8
## 0.005473026 0.029710711 0.431587177 0.401094605 0.121970289 0.010164191
Con la gráfica y los datos podemos ver que la mayor parte de los vinos (sobre un 83% de ellos) están clasificados con valores de calidad de 5 y 6, sobre calificaciones que van de 0 a 10.
Analizamos si nuestras variables tienen valores atípicos, cuales son sus valores medios y vemos sus intervalos de confianza, a través de gráficos de tipo Boxplot.
Boxplot variable alcohol
BoxPlot_alcohol <- ggplot(train, aes(x = factor(quality), y = alcohol)) +
geom_boxplot() + geom_boxplot(fill = "#F8766D") + ggtitle("Boxplot alcohol")
BoxPlot_alcohol
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor % de alcohol.
Boxplot variable citric_acid
BoxPlot_citric_acid <- ggplot(train, aes(x = factor(quality),
y = citric_acid)) + geom_boxplot() + geom_boxplot(fill = "#E58700") +
ggtitle("Boxplot citric_acid")
BoxPlot_citric_acid
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de ácido cítrico.
Boxplot variable density
BoxPlot_density <- ggplot(train, aes(x = factor(quality), y = density)) +
geom_boxplot() + geom_boxplot(fill = "#C99800") + ggtitle("Boxplot density")
BoxPlot_density
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general una leve menor densidad, pero no es una variable determinante en la calidad del producto.
Boxplot variable fixed_acidity
BoxPlot_fixed_acidity <- ggplot(train, aes(x = factor(quality),
y = fixed_acidity)) + geom_boxplot() + geom_boxplot(fill = "#6BB100") +
ggtitle("Boxplot fixed_acidity")
BoxPlot_fixed_acidity
Apreciamos que la variable “fixed_acidity” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_chlorides
BoxPlot_Log_chlorides <- ggplot(train, aes(x = factor(quality),
y = Log_chlorides)) + geom_boxplot() + geom_boxplot(fill = "#00BA38") +
ggtitle("Boxplot Log_chlorides")
BoxPlot_Log_chlorides
Apreciamos que la variable “Log_chlorides” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_free_sulfur_dioxide
BoxPlot_Log_free_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_free_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00BF7D") +
ggtitle("Boxplot Log_free_sulfur_dioxide")
BoxPlot_Log_free_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_free_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable Log_residual_sugar
BoxPlot_Log_residual_sugar <- ggplot(train, aes(x = factor(quality),
y = Log_residual_sugar)) + geom_boxplot() + geom_boxplot(fill = "#00C0AF") +
ggtitle("Boxplot Log_residual_sugar")
BoxPlot_Log_residual_sugar
Apreciamos que la variable “Log_residual_sugar” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_sulphates
BoxPlot_Log_sulphates <- ggplot(train, aes(x = factor(quality),
y = Log_sulphates)) + geom_boxplot() + geom_boxplot(fill = "#00BCD8") +
ggtitle("Boxplot Log_sulphates")
BoxPlot_Log_sulphates
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de la variable “Log_sulphates”, aunque existen bastantes outlier e puntuaciones de 5 y 6, que podrían llevar a error.
Boxplot variable Log_total_sulfur_dioxide
BoxPlot_Log_total_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_total_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00B0F6") +
ggtitle("Boxplot Log_total_sulfur_dioxide")
BoxPlot_Log_total_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_total_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable pH
BoxPlot_pH <- ggplot(train, aes(x = factor(quality), y = pH)) +
geom_boxplot() + geom_boxplot(fill = "#B983FF") + ggtitle("Boxplot pH")
BoxPlot_pH
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general un leve menor valor de pH,aunque existen numeros outliers en vinos puntuados con 5 y 6 que podrían llevar a error.
Boxplot variable volatile_acidity
BoxPlot_volatile_acidity <- ggplot(train, aes(x = factor(quality),
y = volatile_acidity)) + geom_boxplot() + geom_boxplot(fill = "#FF67A4") +
ggtitle("Boxplot volatile_acidity")
BoxPlot_volatile_acidity
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general menor cantidad de “ácido cítrico”volatile_acidity".
Continuando con en análisis de las distintas variables del data set y el estudio de como se relacionan entre si, queremos ver de forma global como se correlacionan las variables numéricas que nos pueden llegar a servir para el modelo de predicción.
pairs(train)
corrplot(cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric)))
res <- cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric))
round(res, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## quality 0.14 -0.39 0.25 -0.16
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## alcohol quality pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 0.14 -0.64 0.20 0.16
## volatile_acidity -0.21 -0.39 0.22 0.04 0.09
## citric_acid 0.15 0.25 -0.49 0.19 0.16
## density -0.49 -0.16 -0.32 0.44 0.33
## alcohol 1.00 0.49 0.18 0.06 -0.29
## quality 0.49 1.00 -0.07 0.03 -0.16
## pH 0.18 -0.07 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 0.03 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.16 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 -0.03 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.16 -0.03 0.17 0.06
## Log_sulphates 0.13 0.33 -0.14 0.02 0.22
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## quality -0.03 -0.16
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## Log_sulphates
## fixed_acidity 0.19
## volatile_acidity -0.30
## citric_acid 0.32
## density 0.14
## alcohol 0.13
## quality 0.33
## pH -0.14
## Log_residual_sugar 0.02
## Log_chlorides 0.22
## Log_free_sulfur_dioxide 0.06
## Log_total_sulfur_dioxide 0.04
## Log_sulphates 1.00
Vemos que las variables que más estan correlacionadas con la variable “quality” son: “volatile_acidity”, “citric_acid”, “alcohol” y “Log_sulphates”.
Realizamos un análisis bivariante para ver que variables están más correlacionadas, positva o negativamente, entre si.
Correlación: fixed_acidity y citric_acid:
cor(x = train$fixed_acidity, y = train$citric_acid)
## [1] 0.678372
train %>%
ggplot(aes(fixed_acidity, citric_acid)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y citric_acid",
x = "fixed_acidity", y = "citric_acid")
Correlación: fixed_acidity y density:
cor(x = train$fixed_acidity, y = train$density)
## [1] 0.6782196
train %>%
ggplot(aes(fixed_acidity, density)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y density",
x = "fixed_acidity", y = "density")
Correlación: fixed_acidity y pH:
cor(x = train$fixed_acidity, y = train$pH)
## [1] -0.644656
train %>%
ggplot(aes(fixed_acidity, pH)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y pH",
x = "fixed_acidity", y = "pH")
Correlación: citric_acid y volatile_acidity:
cor(x = train$citric_acid, y = train$volatile_acidity)
## [1] -0.5538307
train %>%
ggplot(aes(citric_acid, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables citric_acid y volatile_acidity",
x = "citric_acid", y = "volatile_acidity")
Correlación: citric_acid y pH:
cor(x = train$citric_acid, y = train$pH)
## [1] -0.4941459
train %>%
ggplot(aes(citric_acid, pH)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables citric_acid y pH",
x = "citric_acid", y = "pH")
Correlación: density y Log_residual_sugar:
cor(x = train$density, y = train$Log_residual_sugar)
## [1] 0.4399375
train %>%
ggplot(aes(density, Log_residual_sugar)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables density y Log_residual_sugar",
x = "density", y = "Log_residual_sugar")
Correlación: density y alcohol:
cor(x = train$density, y = train$alcohol)
## [1] -0.4880924
train %>%
ggplot(aes(density, alcohol)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables density y alcohol",
x = "density", y = "alcohol")
Correlación: quality y alcohol:
cor(x = train$quality, y = train$alcohol)
## [1] 0.4895963
train %>%
ggplot(aes(train$quality, train$alcohol)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y alcohol",
x = "quality", y = "alcohol")
Correlación: quality y volatile_acidity:
cor(x = train$quality, y = train$volatile_acidity)
## [1] -0.3904367
train %>%
ggplot(aes(quality, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y volatile_acidity",
x = "quality", y = "volatile_acidity")
Correlación: Log_free_sulfur_dioxide y Log_total_sulfur_dioxide:
cor(x = train$Log_free_sulfur_dioxide, y = train$Log_total_sulfur_dioxide)
## [1] 0.7856495
train %>%
ggplot(aes(Log_free_sulfur_dioxide, Log_total_sulfur_dioxide)) +
geom_point(alpha = 0.2, colour = "green") + geom_smooth(formula = "y ~ x",
method = "lm") + labs(title = "Relación entre variables Log_free_sulfur_dioxide y Log_total_sulfur_dioxide",
x = "Log_free_sulfur_dioxide", y = "Log_total_sulfur_dioxide")
Realizamos los cambios y modificaciones necesarias sobre el conjunto de datos de test, aplicados previamente sobre nuestro dataset de train.
Imputamos los NAs del data set de test:
test$pH[is.na(test$pH)] <- median(test$pH, na.rm = TRUE)
test$sulphates[is.na(test$sulphates)] <- median(test$sulphates,
na.rm = TRUE)
summary(test)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.70 Min. :0.1600 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.080 1st Qu.: 1.900
## Median : 7.80 Median :0.5200 Median :0.250 Median : 2.150
## Mean : 8.17 Mean :0.5341 Mean :0.262 Mean : 2.486
## 3rd Qu.: 9.00 3rd Qu.:0.6600 3rd Qu.:0.420 3rd Qu.: 2.525
## Max. :15.00 Max. :1.2400 Max. :1.000 Max. :13.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 3.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.06800 1st Qu.: 7.00 1st Qu.: 20.00 1st Qu.:0.9957
## Median :0.07800 Median :14.00 Median : 37.00 Median :0.9967
## Mean :0.08452 Mean :15.95 Mean : 46.58 Mean :0.9967
## 3rd Qu.:0.08725 3rd Qu.:22.00 3rd Qu.: 63.25 3rd Qu.:0.9977
## Max. :0.61000 Max. :72.00 Max. :160.00 Max. :1.0024
## alcohol quality pH sulphates
## Min. : 8.80 Min. :3.000 Min. :2.740 Min. :0.3900
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.210 1st Qu.:0.5500
## Median :10.10 Median :6.000 Median :3.320 Median :0.6200
## Mean :10.39 Mean :5.641 Mean :3.315 Mean :0.6583
## 3rd Qu.:11.00 3rd Qu.:6.000 3rd Qu.:3.400 3rd Qu.:0.7100
## Max. :14.00 Max. :8.000 Max. :3.850 Max. :2.0000
Transformamos a logaritmicas las variables previamente normalizadas:
test <- test %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
Modificamos nuestro dataset de test para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
test <- test %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
head(test)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol quality pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 5 3.51
## 2 7.3 0.65 0 0.995 10 7 3.39
## 3 8.9 0.22 0.48 0.997 9.4 6 3.39
## 4 7.6 0.41 0.24 0.996 9.5 5 3.28
## 5 7.1 0.71 0 0.997 9.4 5 3.47
## 6 5.7 1.13 0.09 0.994 9.8 4 3.5
## # ... with 5 more variables: Log_residual_sugar <dbl>, Log_chlorides <dbl>,
## # Log_free_sulfur_dioxide <dbl>, Log_total_sulfur_dioxide <dbl>,
## # Log_sulphates <dbl>
Una vez analizado en profundidad nuestro conjunto de datos y habiendo entendido y tranformado nuetras variables, trataremos de ajustar un modelo de regresión lineal múltiple que trate de predicir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde”.
Ajustamos un modelo de regresión lineal mútiple con el que vamos a predecir el valor de la variable quality a partir de las siguientes variables independientes(cogemos todas las variables menos “Log_residual_sugar” que no presenta ninguna correlación con la variable “quality”) seleccionadas en base a los análisis y estudios de correlación vistos con anterioridad.
modelo = lm(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
summary(modelo)
##
## Call:
## lm(formula = quality ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27885 -0.34712 -0.05254 0.43254 1.89518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.94814 18.67185 -0.533 0.594274
## alcohol 0.30374 0.02413 12.587 < 2e-16 ***
## fixed_acidity 0.01981 0.02396 0.827 0.408630
## volatile_acidity -1.04091 0.12864 -8.092 1.37e-15 ***
## citric_acid -0.31703 0.15340 -2.067 0.038963 *
## Log_chlorides -0.22535 0.06205 -3.632 0.000293 ***
## Log_total_sulfur_dioxide -0.15705 0.04408 -3.563 0.000380 ***
## Log_free_sulfur_dioxide 0.12800 0.04305 2.973 0.003001 **
## density 14.65001 19.00095 0.771 0.440842
## pH -0.50069 0.18785 -2.665 0.007788 **
## Log_sulphates 0.85751 0.09545 8.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6241 on 1268 degrees of freedom
## Multiple R-squared: 0.3862, Adjusted R-squared: 0.3813
## F-statistic: 79.77 on 10 and 1268 DF, p-value: < 2.2e-16
Para la selección de variables se utiliza el método de la selección automática por pasos.
empty.model <- lm(quality ~ 1, data = train)
horizonte <- formula(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates)
# metodo de selección por pasos e indica las variables que
# son significativas
seleccion = stepAIC(empty.model, direction = c("both"), trace = FALSE,
scope = horizonte)
seleccion$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## quality ~ 1
##
## Final Model:
## quality ~ alcohol + volatile_acidity + Log_sulphates + Log_chlorides +
## pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1278 804.4848 -590.9851
## 2 + alcohol 1 192.838647 1277 611.6461 -939.4926
## 3 + volatile_acidity 1 69.859784 1276 541.7863 -1092.6125
## 4 + Log_sulphates 1 29.717001 1275 512.0693 -1162.7631
## 5 + Log_chlorides 1 4.327006 1274 507.7423 -1171.6166
## 6 + pH 1 5.098410 1273 502.6439 -1182.5244
## 7 + Log_total_sulfur_dioxide 1 2.458061 1272 500.1858 -1186.7944
## 8 + Log_free_sulfur_dioxide 1 3.993776 1271 496.1921 -1195.0476
Vemos la información del modelo elegido como “mejor”
summary(seleccion)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24223 -0.35766 -0.05925 0.43097 1.88984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97311 0.44122 11.271 < 2e-16 ***
## alcohol 0.28289 0.01841 15.366 < 2e-16 ***
## volatile_acidity -0.90819 0.10940 -8.301 2.60e-16 ***
## Log_sulphates 0.86434 0.09387 9.208 < 2e-16 ***
## Log_chlorides -0.23817 0.05970 -3.989 7.00e-05 ***
## pH -0.52666 0.13211 -3.987 7.08e-05 ***
## Log_total_sulfur_dioxide -0.17181 0.04226 -4.065 5.10e-05 ***
## Log_free_sulfur_dioxide 0.13511 0.04224 3.198 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6248 on 1271 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3798
## F-statistic: 112.8 on 7 and 1271 DF, p-value: < 2.2e-16
Nos quedamos con el modelo seleccionado como el mejor para la regresión según el método utilizado anteriormente.
mejor_modelo = lm(quality ~ alcohol + volatile_acidity + Log_sulphates +
Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
fixed_acidity, data = train)
summary(mejor_modelo)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
## fixed_acidity, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.31169 -0.35282 -0.05415 0.42911 1.86472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.27164 0.63721 6.704 3.05e-11 ***
## alcohol 0.30000 0.01884 15.921 < 2e-16 ***
## volatile_acidity -1.07446 0.12691 -8.466 < 2e-16 ***
## Log_sulphates 0.88622 0.09458 9.370 < 2e-16 ***
## Log_chlorides -0.21728 0.06080 -3.574 0.000365 ***
## pH -0.40708 0.16560 -2.458 0.014097 *
## Log_total_sulfur_dioxide -0.05076 0.02651 -1.915 0.055768 .
## citric_acid -0.38575 0.15186 -2.540 0.011197 *
## fixed_acidity 0.03469 0.01610 2.155 0.031357 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6258 on 1270 degrees of freedom
## Multiple R-squared: 0.3818, Adjusted R-squared: 0.3779
## F-statistic: 98.03 on 8 and 1270 DF, p-value: < 2.2e-16
Determinamos los intervalos de confianza para las observaciones de nuestros datos.
intervalos = predict(mejor_modelo, interval = "confidence", level = 0.95)
head(intervalos)
## fit lwr upr
## 1 5.545952 5.455099 5.636806
## 2 5.095579 5.013600 5.177558
## 3 4.981789 4.813456 5.150122
## 4 5.381935 5.312809 5.451061
## 5 5.445889 5.371787 5.519992
## 6 4.747729 4.589377 4.906082
La tabla anova nos muestra la significación de la regresión
anova = aov(mejor_modelo)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 192.8 192.84 492.415 < 2e-16 ***
## volatile_acidity 1 69.9 69.86 178.387 < 2e-16 ***
## Log_sulphates 1 29.7 29.72 75.883 < 2e-16 ***
## Log_chlorides 1 4.3 4.33 11.049 0.000913 ***
## pH 1 5.1 5.10 13.019 0.000320 ***
## Log_total_sulfur_dioxide 1 2.5 2.46 6.277 0.012358 *
## citric_acid 1 1.0 1.01 2.584 0.108223
## fixed_acidity 1 1.8 1.82 4.644 0.031357 *
## Residuals 1270 497.4 0.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(mejor_modelo)
## alcohol volatile_acidity Log_sulphates
## 1.326586 1.667617 1.236976
## Log_chlorides pH Log_total_sulfur_dioxide
## 1.301107 1.887347 1.125057
## citric_acid fixed_acidity
## 2.892183 2.660426
mean(vif(mejor_modelo))
## [1] 1.762162
Generalmente, un VIF por encima de 4 o una tolerancia por debajo de 0,25 indica que podría existir multicolinealidad (fuerte correlación entre variables explicativas del modelo) y se requiere más investigación. Cuando el VIF es superior a 10 o la tolerancia es inferior a 0,1, existe una multicolinealidad significativa que debe corregirse. En este caso no se observa multicolinealidad.
mean(mejor_modelo$residuals)
## [1] -1.419695e-17
# forma grafico 1
plot(mejor_modelo, 1)
# forma grafico 2 que te muestra lo mismo
autoplot(mejor_modelo, 1)
En el gráfico de Residuos vs. Ajustes se observa que la media de los residuos es cercana a cero (aunque no de forma constante), luego la linealidad del modelo no se viola en principio. Pero, al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Primero se comprueba la normalidad de los residuos, pero al usar Shapiro test solo permite usar las 5000 primeras muestras de los residuos, así que también usamos Anderson-Darling para comparar resultados
Shapiro-Wilk:
# muestras_residuos=resid(mejor_modelo) obtengo la
# ditribucion de los residuos estandarizados
muestras_residuos1 = studres(mejor_modelo)
residual_norm = shapiro.test(muestras_residuos1[0:5000])
residual_norm
##
## Shapiro-Wilk normality test
##
## data: muestras_residuos1[0:5000]
## W = 0.98969, p-value = 7.913e-08
Anderson-Darling:
# install.packages('nortest')
residual_anderson = ad.test(muestras_residuos1)
residual_anderson
##
## Anderson-Darling normality test
##
## data: muestras_residuos1
## A = 4.0651, p-value = 4.078e-10
Este supuesto de normalidad de los residuos también se puede comprobar graficamente y como se ve en la gráfica nuestros datos se separan en las colas de la línea principal y eso nos puede indicar que los residuos no siguen una distribución normal.
# Estas tres graficas te muestran lo mismo
plot(mejor_modelo, 2)
autoplot(mejor_modelo, 2)
hist(muestras_residuos1, freq = FALSE, main = "Distribución de los residuos estandarizados")
xfit <- seq(min(muestras_residuos1), max(muestras_residuos1),
length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)
Con el Q-Q plot vemos que los residuos siguen una distribución normal o al menos se aproximan. Por tanto, se puede asumir que los estimadores de los coeficientes tengan una distribución normal.
Vamos a comprobar la homocedasticidad (que los residuos tengan una varianza constante)
Como podemos ver en los resultados p_value < 0.05 por tanto se rechaza la hipotesis nula y esto indica que la varianza no es constante para este modelo de regresion lineal(hay heterocedasticidad, y esto es un problema). Podemos concluir que este modelo matemático no es adecuado.
# https://fhernanb.github.io/libro_regresion/homo.html otra
# prueba para comprobar homocedasticidad
ncvTest(mejor_modelo)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 17.5856, Df = 1, p = 2.7466e-05
También podemos comprobar gráficamente la hocedasticidad, sería bueno que la línea roja sea lo más recta/horizontal posible.
plot(mejor_modelo, 3)
autoplot(mejor_modelo, 3)
Como se puede ver en los resultados el p_value > 0.05 por lo que aceptamos la Ho de que hay independencia.
dwtest(mejor_modelo)
##
## Durbin-Watson test
##
## data: mejor_modelo
## DW = 2.0088, p-value = 0.563
## alternative hypothesis: true autocorrelation is greater than 0
Se puede comprobar la independencia de los residuos gráficamente y como se observa no se ven patrónes extraños y esto nos puede indicar que hay independencia en los residuos y que estos no presentan autocorrelación.
plot(mejor_modelo$resid)
acf(mejor_modelo$residuals)
Para el análisis de componentes principales cogemos todas las variables de nuestro dataset, menos “quality” que es la que queremos tratar de predecir.
prcomp_train <- prcomp(train[, -6])
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7932919537 1.1139049116 0.8715249896 0.3582166080 0.3099346034
## [6] 0.3016065686 0.2104320866 0.1621592924 0.1042111183 0.0987024931
## [11] 0.0007134006
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.987707565 -0.0103012861 -1.022231e-01 0.062459844
## volatile_acidity -0.025573618 -0.0335218329 2.264482e-02 -0.079018774
## citric_acid 0.074203260 0.0250265448 -3.742426e-02 -0.046522512
## density 0.000724006 -0.0007910339 8.763104e-05 -0.001630274
## alcohol -0.033238009 0.9069893906 -3.959749e-01 -0.033337688
## pH -0.052371326 0.0213916196 4.913437e-03 0.011180103
## Log_residual_sugar 0.039855973 -0.0021044752 -1.037845e-01 -0.826895669
## Log_chlorides 0.031702071 -0.0848605722 3.174456e-02 -0.391082136
## Log_free_sulfur_dioxide -0.086908780 -0.2361937132 -6.575105e-01 0.323215453
## Log_total_sulfur_dioxide -0.061636505 -0.3343377688 -6.206551e-01 -0.211995173
## Log_sulphates 0.021906011 0.0167601084 -3.879522e-02 -0.018426525
## PC5 PC6 PC7 PC8
## fixed_acidity -0.0230198704 -0.0059651163 -0.064939449 -0.039228108
## volatile_acidity -0.0390618557 -0.0303160677 -0.597211835 -0.541189128
## citric_acid 0.0729849365 0.0909937418 0.381491021 0.409952831
## density -0.0002677616 -0.0007711058 0.000185434 -0.001576883
## alcohol 0.0565568851 0.0881221105 -0.082562820 -0.001514142
## pH -0.0569108452 -0.0518242796 -0.038186548 -0.140386718
## Log_residual_sugar -0.2808810688 -0.4555153363 0.125710162 -0.007655261
## Log_chlorides 0.8611890499 0.1118830637 -0.226520421 0.140520863
## Log_free_sulfur_dioxide 0.2215346137 -0.5890496504 -0.021993993 0.034613521
## Log_total_sulfur_dioxide -0.1995578676 0.6396298085 -0.031382301 -0.022315853
## Log_sulphates 0.2770165639 0.0647750830 0.645544341 -0.704485728
## PC9 PC10 PC11
## fixed_acidity 0.041080555 -0.0443511620 0.0008574666
## volatile_acidity -0.081317870 0.5769403933 0.0004106125
## citric_acid 0.118259173 0.8056760674 -0.0002945593
## density 0.004180642 -0.0002855141 -0.9999877190
## alcohol -0.015362480 -0.0183878205 -0.0008768774
## pH 0.983885416 -0.0388597280 0.0043212789
## Log_residual_sugar -0.025609909 -0.0311525353 0.0017331842
## Log_chlorides 0.071354850 -0.0867580300 0.0004730618
## Log_free_sulfur_dioxide -0.012248313 0.0445217874 -0.0001883117
## Log_total_sulfur_dioxide 0.025005306 -0.0566273254 0.0002213595
## Log_sulphates -0.053924917 0.0228971577 0.0009037505
Las desviaciones típicas son los autovalores de la matriz de correlaciones, y representan la variabilidad en cada componente. A mayor valor, más relevante es la variable correspondiente a efectos de visualización. Si queremos visualizar la importancia relativa de cada componente, haremos lo siguiente:
plot(prcomp_train)
De modo númérico:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7933 1.1139 0.8715 0.35822 0.30993 0.30161 0.21043
## Proportion of Variance 0.5719 0.2207 0.1351 0.02282 0.01708 0.01618 0.00788
## Cumulative Proportion 0.5719 0.7926 0.9277 0.95052 0.96761 0.98378 0.99166
## PC8 PC9 PC10 PC11
## Standard deviation 0.16216 0.10421 0.09870 0.0007134
## Proportion of Variance 0.00468 0.00193 0.00173 0.0000000
## Cumulative Proportion 0.99634 0.99827 1.00000 1.0000000
Para solucionar el problema de que una variable tenga más relevancia y sea más influyente por el hecho de tener más magnitud, se debe realizar una estandarización:
prcomp_train <- prcomp(train[, -6], centre = TRUE, scale = TRUE)
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
De modo numérico también:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
Analizamos la varianzas y las componentes de un modo más gráfico:
prcomp_train.var <- prcomp_train$sdev^2
prcomp_train.pvar <- prcomp_train.var/sum(prcomp_train.var)
plot(cumsum(prcomp_train.pvar), xlab = "components", ylab = "cumulative variance",
ylim = c(0, 1), type = "b")
grid()
abline(h = 0.95, col = "blue")
plot(prcomp_train, type = "l", main = "Variance explained by PCA")
fviz_screeplot(prcomp_train, addlabels = TRUE)
Como vemos, con las dos primeras componentes (PC1 y PC2) recogemos solo el 47.10% de la variabilidad. Con las tres primeras (PC1, PC2 y PC3) incrementamos la cifra hasta el 61.63%. Esto quiere decir que un gráfico de los datos del vino representados por las dos o tres primeras componentes principales no será suficientemente representativo. Vemos además en el gráfico de componentes y varianza acumulada, como son necesarias las 8 primeras PC para cubrir el 95% de la varianza del dataset. Es dificil encontrale sentido a reducir tan solo la dimensión de 11 variables a 8 PC, con la perdida de explicabilidad que eso implica sobre las variables originales.
Dibujamos los datos proyectados sobre las dos primeras componentes:
ggplot(as.data.frame(prcomp(train[, -6], scale = T)$x[, 1:2]),
aes(x = PC1, y = PC2, label = rownames(train))) + geom_point() +
geom_text(hjust = 0, vjust = 0)
Tratamos de incorporar la información de las variables utilizando la técnica del “biplot”:
ggbiplot(prcomp(train[, -6], labels = rownames(train), scale = T))
ggbiplot(prcomp(train[, -6], scale = T), ellipse = TRUE, labels = rownames(train),
groups = train$quality)
train_fquality <- train %>%
mutate(quality = as.factor(quality))
ggbiplot(prcomp_train, obs.scale = 1, var.scale = 1, groups = train_fquality$quality,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que el análisis con solo 2 componentes no es óptimo ya que por ellas mismas no explican un alto porcentaje de la varianza. Aún así, a nivel de análisis explicativo de los datos y de los posibles diferentes grupos, se intuye algún patrón ya que en principio cuanto más abajo del gráfico, mejor calificación tienen los vinos en general (puntos de colores azul y rosa son notas más cercanas a 7 y 8) y más arriba, peor calificación (puntos de colores verde, amarillo y rojo son notas de 5 para abajo).
Realizamos una ampliación del análisis realizado utilizando las 4 primeras componentes principales para tratar de identificar posible agrupaciones más claras de los datos.
colores <- function(vec) {
# la función rainbow() devuelve un vector que contiene
# el número de colores distintos
col <- rainbow(length(unique(vec)))
return(col[as.numeric(as.factor(vec))])
}
par(mfrow = c(1, 2))
# Observaciones sobre PC1 y PC2
plot(prcomp_train$x[, 1:2], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC2")
# Observaciones sobre PC1 y PC3
plot(prcomp_train$x[, c(1, 3)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC3")
# Observaciones sobre PC1 y PC4
plot(prcomp_train$x[, c(1, 4)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC4")
La utilización de más componentes (ampliando el análisis hasta la tercera y la cuarta PC) vemos que aporta muy poco y no vemos agrupaciones claras o destacables entre los diferentes grupos. Esto se debe que incluso utilizando las 4 dimensiones de las 4 primeras PC, apenas lograriamos explicar un 71.52% de la varianza de los datos.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_pca <- train[, colnames(train) != "quality"]
train_pca$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_pca)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_pca$nota_vino)
##
## aprobado suspenso
## 682 597
train_pca
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
Pasamos a realizar el análisis de las Componentes Principales como se ha hecho con anterioridad:
prcomp_train_2 <- prcomp(train_pca[, -12], centre = TRUE, scale = TRUE)
prcomp_train_2
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
summary(prcomp_train_2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
ggbiplot(prcomp_train_2, obs.scale = 1, var.scale = 1, groups = train_pca$nota_vino,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que los resultados obtenidos son los mismos, no obteniendo ninguna mejora. Con esta forma de mostrar los datos realizamos una visualización más clara de lo que nos referiamos.Los puntos más abajo del gráfico se corresponden en general a vinos “aprobados” (puntos de color rosado - vinos con nota igual o superior a 6) y los de más arriba se referencian en general a vinos “suspensos” (puntos de color azulado - vino con notas inferiores a 6). Fuera de eso, y con tan solo un 47.10% de la varianza explicada por las 2 primeras PC, no se aprecian más patrones o conclusiones en los datos.
Intentamos realizar una reducción de la dimensión pero esta vez con métodos, al contrario de PCA, que no sean lineales. Con el algoritmo de t-SNE podemos separar datos que no sean separables de una forma lineal con exclusivamente una línea recta, es decir, nos puede llegar a permitir trabajar con datos lineales no separables. Nos puede servir para llegar a entender datos que tienen una alta dimensión projectándolo a una dimensión menor de solo 2 o 3 espacios o dimensiones.
tsne_train <- (train[, -6])
tsne_train
## # A tibble: 1,279 x 11
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>
El algoritmo crea una probabilidad de distribución que representa las similaridades entre los vecinos para así tratar de agruparlos, reduciendo la dimensión.
set.seed(3)
tsne_data <- tsne_train[, 1:11]
tsne <- Rtsne(tsne_data, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_fquality$quality, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Como vemos los resultados, como en PCA, no son satisfactorios, siendo no deseable la apliclación de estas técnicas en nuestro dataset.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_tsne <- train[, colnames(train) != "quality"]
train_tsne$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_tsne)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_tsne$nota_vino)
##
## aprobado suspenso
## 682 597
train_tsne
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
set.seed(3)
tsne_data_2 <- train_tsne[, 1:11]
tsne_2 <- Rtsne(tsne_data_2, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_tsne$nota_vino, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Binarizando la variable respuesta tampoco sacamos demasiado en claro, no siendo posible aplicar una reducción de la dimensión sobre nuestros datos.
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
train_glm <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_glm
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_glm$nota_vino)
##
## 0 1
## 597 682
str(train_glm)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Realizando esta distinción entre vinos “Aprobados” y “Suspensos”, vemos que la distibución entre ambos grupos está bastante balanceada, con 597 suspensos y 682 aprobados en los datos de train.
Tras ello, pasamos a ver las correlaciones y el comportamiento de las variables con esta nueva variable categórica creada:
c <- cor(train_glm)
corrplot(c)
Mostramos las correlaciones de forma numérica:
round(c, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## nota_vino 0.11 -0.32 0.18 -0.15
## alcohol pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 -0.64 0.20 0.16
## volatile_acidity -0.21 0.22 0.04 0.09
## citric_acid 0.15 -0.49 0.19 0.16
## density -0.49 -0.32 0.44 0.33
## alcohol 1.00 0.18 0.06 -0.29
## pH 0.18 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.03 0.17 0.06
## Log_sulphates 0.13 -0.14 0.02 0.22
## nota_vino 0.45 -0.01 0.00 -0.14
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## nota_vino -0.06 -0.20
## Log_sulphates nota_vino
## fixed_acidity 0.19 0.11
## volatile_acidity -0.30 -0.32
## citric_acid 0.32 0.18
## density 0.14 -0.15
## alcohol 0.13 0.45
## pH -0.14 -0.01
## Log_residual_sugar 0.02 0.00
## Log_chlorides 0.22 -0.14
## Log_free_sulfur_dioxide 0.06 -0.06
## Log_total_sulfur_dioxide 0.04 -0.20
## Log_sulphates 1.00 0.28
## nota_vino 0.28 1.00
Analizamos de forma bivariante las variables:
# nota_vino vs alcohol
train_glm %>%
ggplot(aes(x = alcohol, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_sulphates
train_glm %>%
ggplot(aes(x = Log_sulphates, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs volatile_acidity
train_glm %>%
ggplot(aes(x = volatile_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs density
train_glm %>%
ggplot(aes(x = density, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs citric_acid
train_glm %>%
ggplot(aes(x = citric_acid, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_total_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_total_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En términos generales vemos como los vinos analizados que estan en la categoria de aprobados, tienen un mayor valor de “alcohol”, levenmente mayor valor de “Log_sulphates”, menor valor de “volatile_acidity”, levemente menor “density”, mayor “citric_acid” y menor valor de “Log_total_sulfur_dioxide”.
# nota_vino vs fixed_acidity
train_glm %>%
ggplot(aes(x = fixed_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_free_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_free_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_residual_sugar
train_glm %>%
ggplot(aes(x = Log_residual_sugar, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs pH
train_glm %>%
ggplot(aes(x = pH, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_chlorides
train_glm %>%
ggplot(aes(x = Log_chlorides, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En los casos de las variables “Log_chlorides”, “pH”, “Log_residual_sugar”, “Log_free_sulfur_dioxide” y “fixed_acidity”, cuesta más distinguir en el gráfico de densidad entre vinos aprobados o suspensos, ya que no son características definitivas de un grupo u otro.
Generamos un modelo de regresión logística en base a las variables de nuestro dataset que sirva como predictor de la variable binaria creada.
modelo_glm = glm(nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train_glm,
family = binomial)
modelo_glm
##
## Call: glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Coefficients:
## (Intercept) alcohol fixed_acidity
## -24.7457 0.9700 0.1753
## volatile_acidity citric_acid Log_chlorides
## -3.2462 -1.7627 -0.4797
## Log_total_sulfur_dioxide Log_free_sulfur_dioxide density
## -0.6740 0.4413 17.5474
## pH Log_sulphates
## -0.1818 2.6310
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1268 Residual
## Null Deviance: 1767
## Residual Deviance: 1300 AIC: 1322
summary(modelo_glm)
##
## Call:
## glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3178 -0.8232 0.3053 0.7861 2.4429
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.74575 73.63188 -0.336 0.73682
## alcohol 0.96996 0.10386 9.340 < 2e-16 ***
## fixed_acidity 0.17526 0.09475 1.850 0.06436 .
## volatile_acidity -3.24621 0.54263 -5.982 2.20e-09 ***
## citric_acid -1.76266 0.60585 -2.909 0.00362 **
## Log_chlorides -0.47972 0.24746 -1.939 0.05256 .
## Log_total_sulfur_dioxide -0.67404 0.17254 -3.907 9.36e-05 ***
## Log_free_sulfur_dioxide 0.44134 0.16690 2.644 0.00818 **
## density 17.54736 74.95795 0.234 0.81491
## pH -0.18179 0.73995 -0.246 0.80593
## Log_sulphates 2.63103 0.39252 6.703 2.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1300.2 on 1268 degrees of freedom
## AIC: 1322.2
##
## Number of Fisher Scoring iterations: 4
Como observamos, nos quedamos solo con las variables significativas que relamente afectan a “nota_vino”, y creamos un nuevo modelo exclusivamente con ellas (“Log_sulphates”, “Log_total_sulfur_dioxide”, “volatile_acidity” y “alcohol”). De esta forma simplificamos el modelo, nos quedamos con las varibales realmente importantes para el modelo predictor y creamos el mejor modelo de regresión logística posible para nuestro conjunto de datos.
modelo_glm2 = glm(nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
Log_total_sulfur_dioxide, data = train_glm, family = binomial)
modelo_glm2
##
## Call: glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Coefficients:
## (Intercept) alcohol volatile_acidity
## -5.9714 0.9694 -2.7563
## Log_sulphates Log_total_sulfur_dioxide
## 2.2830 -0.3912
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1274 Residual
## Null Deviance: 1767
## Residual Deviance: 1329 AIC: 1339
summary(modelo_glm2)
##
## Call:
## glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Para realizar la interpretación de los coeficientes:
round(exp(cbind(Estimate = coef(modelo_glm2), confint(modelo_glm2))),
2)
## Estimate 2.5 % 97.5 %
## (Intercept) 0.00 0.00 0.02
## alcohol 2.64 2.27 3.09
## volatile_acidity 0.06 0.03 0.14
## Log_sulphates 9.81 4.97 19.79
## Log_total_sulfur_dioxide 0.68 0.55 0.83
Los intervalos de confianza no se basan en un test de Wald (como en regresión tradicional), sino en un perfilado (profiling) de la log-likelihood, que es más preciso.
Predicción de valores del modelo:
head(predict(modelo_glm2))
## 1 2 3 4 5 6
## 0.1559899 -1.5792806 -1.4338215 -0.6054977 -0.6622377 -1.5725449
Probabilidad en escala de la salida:
head(predict(modelo_glm2, type = "response"))
## 1 2 3 4 5 6
## 0.5389186 0.1708974 0.1925039 0.3530869 0.3402371 0.1718539
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs <- predict(modelo_glm2, train_glm, type = "response")
train_glm$y_pred <- ifelse(train_glm$y_pred_probs > 0.5, 1, 0)
# train_glm$y_pred_probs train_glm$y_pred
cm_train <- confusionMatrix(as.factor(train_glm$y_pred), as.factor(train_glm$nota_vino),
positive = "1")
cm_train$table
## Reference
## Prediction 0 1
## 0 445 172
## 1 152 510
# result
cm_train$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.75
cm_train$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Viendo el valor de las metricas obtenidas, el valor de Accuracy (número de predicciones correctas/número total de predicciones) se situa en el 75%, el de Precision (positivos verdaderos/(positivos verdaderos + falsos positivos)) se situa en un 77%, y el de Recall o Sensitividad (positivos verdaderos/(positivos verdaderos/falsos negativos)) en un 75%.
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
Tratamos de aplicar Cross Validation sobre el modelo de GLM y realizar una selección de hiperparámetros:
Vemos primero cuales son las posibles variables que tienes el modelo para tratar de configurar. Cómo se puede ver, el modelo GLM no tiene la posibilidad de ajustar hiperparámetros.
## https://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/?msclkid=37e9f222aa8711ec9c857e7c4b89d202
## https://daviddalpiaz.github.io/r4sl/the-caret-package.html#classification
# Vemos hiperparámetros que se pueden configurar
modelLookup("glm")
## model parameter label forReg forClass probModel
## 1 glm parameter parameter TRUE TRUE TRUE
Creamos el modelo con las variables seleccionadas como relevantes y haciendo Cross Validation on 5 particiones del dateset de train.
caret.glm <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity + Log_sulphates + Log_total_sulfur_dioxide,
method = "glm",
family = "binomial",
data = train_glm,
trControl = trainControl(method = "cv", number = 5))
caret.glm
## Generalized Linear Model
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1022, 1023
## Resampling results:
##
## Accuracy Kappa
## 0.7482917 0.4956541
summary(caret.glm)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 74.90% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
confusionMatrix(caret.glm)
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 35.1 13.6
## 1 11.6 39.7
##
## Accuracy (average) : 0.7482
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs2 <- predict(caret.glm, train_glm, type = "prob")
train_glm$y_pred2 <- ifelse(train_glm$y_pred_probs2 > 0.5, 1,
0)
# train_glm$y_pred_probs2
t #rain_glm$y_pred2
## function (x)
## UseMethod("t")
## <bytecode: 0x0881fd50>
## <environment: namespace:base>
# plot.roc(as.numeric(train_glm$nota_vino),
# as.numeric(train_glm$y_pred_probs2))
Reproducimos la curva ROC sobre el modelo final de GLM obtenido:
glm_roc <- plot.roc(train_glm$nota_vino, train_glm$y_pred_probs)
table(pred = train_glm$y_pred_probs > 0.5, obs = train_glm$nota_vino)
## obs
## pred 0 1
## FALSE 445 172
## TRUE 152 510
auc(glm_roc)
## Area under the curve: 0.8195
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_knn <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_knn
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_knn$nota_vino)
##
## 0 1
## 597 682
str(train_knn)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# REFERENCIA:https://www.edureka.co/blog/knn-algorithm-in-r/
# train_knn <- train[, colnames(train) != 'quality']
# train_knn$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado','suspenso')) # levels = c('FALSE', 'TRUE')
# train_knn table(train_knn$nota_vino)
# str(train_knn)
Lo primero de todo calculamos el número de observaciones que tiene nuestro dataset en train. Queremos así ver de inicio el número de “K” o vecinos con el que cuenta nuestro conjunto de datos de entrenamiento, para posteriormente y en base a ello optener el óptimo valor de “K”.
NROW <- NROW(train_knn)
NROW
## [1] 1279
Para obtener el valor óptimo aproximado de “K” realizamos la raiz cuadrada del número total de observaciones del dataset de train
sqrt(1279)
## [1] 35.76311
Para trataer de mejorar las medidas de precisión del modelo KNN realizaremos cross-validation:
#set.seed(22222220)
#train_knn_nor <- as.data.frame(scale(train_knn))
#train_z <- train_knn_nor[1:1000,]
#prueba_z <- train_knn_nor[1001:1279,]
#entrenamiento_labels <- train_knn_nor[1:1000,12]
#train_knn_labels <- train_knn[,12]
#knn_cv <- knn.cv(train_knn_nor[,1:11], cl= as.factor(train_knn$nota_vino), k=35)
#knn <- knn(train_z, prueba_z, cl= as.factor(entrenamiento_labels), k=35)
#table(knn_cv, as.factor(train_knn$nota_vino))
#table(knn, as.factor(train_knn$nota_vino))
#accuracy = sum(knn_cv == train_knn$nota_vino) /nrow(train_knn_nor)
#error = 1-accuracy
#accuracy
Ahora estamos actuando únicamente sobre datos de TRAIN y por tanto sí podemos emplear técnicas para optimizar el resultado. Siempre cuidando de no caer en el error del sobreajuste.
#long = 50
#accuracy = rep(0,long)
#f1score = rep(0,long)
#recall = rep(0,long)
#precision = rep(0,long)
#for (i in 1:long)
#{
# knn_cv = knn.cv(train_knn_nor[,1:11], k=i, cl=train_knn$nota_vino)
# accuracy[i] = sum(knn_cv == train_knn$nota_vino) /nrow(train_knn_nor)
# recall[i] = sum(knn_cv == train_knn$nota_vino & train_knn$nota_vino == TRUE) / sum(train_knn$nota_vino #== TRUE)
# precision[i] = sum(knn_cv == train_knn$nota_vino & knn_cv == TRUE) / sum(knn_cv == TRUE)
# f1score[i] = 2*precision[i]*recall[i]/(precision[i]+recall[i])
#}
#resultados_knn = as.data.frame(cbind(accuracy,f1score,precision,recall))
#resultados_knn = resultados_knn %>% mutate(index=as.factor(seq(1:long)))
#max(resultados_knn$accuracy)
#which.max(resultados_knn$accuracy)
#ggplot(data=resultados_knn,aes(x=index,y=accuracy)) +
# geom_col(colour="cyan4",fill="cyan3")+
# ggtitle("Accuracy")
A la vista del gráfico podemos elegir 1 como el número de vecinos óptimo en TRAIN.
#prediccion_knn1_train =knn.cv(train_knn_nor[,1:11],
# k=65, cl=train_knn$nota_vino)
#confusionMatrix(table(prediccion_knn1_train,train_knn$nota_vino), positive="1")
Con caret:
modelLookup("knn")
## model parameter label forReg forClass probModel
## 1 knn k #Neighbors TRUE TRUE TRUE
set.seed(22222220)
default_knn_mod = train(nota_vino ~ ., data = train_knn, method = "knn",
trControl = trainControl(method = "cv", number = 5), preProcess = c("center",
"scale"), tuneGrid = expand.grid(k = seq(1, 101, by = 2)))
default_knn_mod
## k-Nearest Neighbors
##
## 1279 samples
## 11 predictor
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1023, 1023, 1023
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.5061972 0.2394264 0.2571936
## 3 0.4487076 0.2555824 0.3121227
## 5 0.4344529 0.2712872 0.3264916
## 7 0.4237836 0.2925480 0.3325424
## 9 0.4242993 0.2888743 0.3411212
## 11 0.4236411 0.2887338 0.3484050
## 13 0.4242014 0.2864561 0.3523926
## 15 0.4242522 0.2857152 0.3565385
## 17 0.4249056 0.2832196 0.3595827
## 19 0.4234204 0.2894033 0.3600900
## 21 0.4230218 0.2911704 0.3619823
## 23 0.4228079 0.2925576 0.3628568
## 25 0.4227244 0.2928066 0.3640757
## 27 0.4224587 0.2948558 0.3653419
## 29 0.4217897 0.2974452 0.3654818
## 31 0.4214151 0.2989128 0.3656444
## 33 0.4217862 0.2977585 0.3665998
## 35 0.4228835 0.2940693 0.3687070
## 37 0.4229730 0.2946472 0.3702003
## 39 0.4223492 0.2971271 0.3703663
## 41 0.4222470 0.2980194 0.3708129
## 43 0.4225593 0.2972405 0.3718740
## 45 0.4219989 0.2998432 0.3719980
## 47 0.4222525 0.2995304 0.3729971
## 49 0.4226305 0.2985528 0.3739368
## 51 0.4233260 0.2962029 0.3752187
## 53 0.4233293 0.2969090 0.3759313
## 55 0.4224786 0.3002975 0.3757497
## 57 0.4230340 0.2986317 0.3766579
## 59 0.4225959 0.3010864 0.3767288
## 61 0.4226846 0.3008182 0.3774033
## 63 0.4225820 0.3016064 0.3777783
## 65 0.4226759 0.3016091 0.3782114
## 67 0.4227211 0.3018326 0.3787567
## 69 0.4229089 0.3012048 0.3793161
## 71 0.4229111 0.3020380 0.3800216
## 73 0.4231405 0.3013250 0.3805254
## 75 0.4225610 0.3045634 0.3805017
## 77 0.4223628 0.3059305 0.3808262
## 79 0.4226074 0.3054217 0.3814828
## 81 0.4226045 0.3056757 0.3819006
## 83 0.4232047 0.3035083 0.3827320
## 85 0.4232515 0.3039249 0.3832400
## 87 0.4232080 0.3046017 0.3835488
## 89 0.4236819 0.3034518 0.3844932
## 91 0.4235793 0.3044937 0.3847450
## 93 0.4231868 0.3067426 0.3848331
## 95 0.4236230 0.3049867 0.3855313
## 97 0.4238544 0.3044939 0.3861137
## 99 0.4239316 0.3046476 0.3864399
## 101 0.4239017 0.3053937 0.3867795
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 31.
summary(default_knn_mod)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 11 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
plot(default_knn_mod)
default_knn_mod$bestTune
## k
## 16 31
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(default_knn_mod)
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 31 0.4214151 0.2989128 0.3656444 0.01298066 0.05423981 0.01234288
# predict(default_knn_mod, newdata = train_knn, type =
# 'prob')
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_tree <- train[, colnames(train) != "quality"]
train_tree$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso"))
train_tree
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
str(train_tree)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
# árbol de clasificación con las opciones por defecto (cp = 0.01 y split = "gini") con el comando:
tree = rpart(nota_vino ~ ., data = train_tree, cp=0.006)
rpart.plot(tree, nn = TRUE, extra = 104, box.palette = "GnBu", branch.lty = 3, shadow.col = "gray")
tree
## n= 1279
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1279 597 aprobado (0.5332291 0.4667709)
## 2) alcohol>=10.525 500 98 aprobado (0.8040000 0.1960000)
## 4) volatile_acidity< 0.87 476 80 aprobado (0.8319328 0.1680672) *
## 5) volatile_acidity>=0.87 24 6 suspenso (0.2500000 0.7500000) *
## 3) alcohol< 10.525 779 280 suspenso (0.3594352 0.6405648)
## 6) Log_sulphates>=-0.4700356 285 134 aprobado (0.5298246 0.4701754)
## 12) volatile_acidity< 0.545 182 64 aprobado (0.6483516 0.3516484)
## 24) Log_total_sulfur_dioxide< 4.166635 140 39 aprobado (0.7214286 0.2785714) *
## 25) Log_total_sulfur_dioxide>=4.166635 42 17 suspenso (0.4047619 0.5952381) *
## 13) volatile_acidity>=0.545 103 33 suspenso (0.3203883 0.6796117)
## 26) Log_chlorides< -2.333097 67 29 suspenso (0.4328358 0.5671642)
## 52) fixed_acidity>=9.55 8 0 aprobado (1.0000000 0.0000000) *
## 53) fixed_acidity< 9.55 59 21 suspenso (0.3559322 0.6440678) *
## 27) Log_chlorides>=-2.333097 36 4 suspenso (0.1111111 0.8888889) *
## 7) Log_sulphates< -0.4700356 494 129 suspenso (0.2611336 0.7388664) *
Analizamos los resultados obtenidos de forma numérica:
rpart.rules(tree, style = "tall")
## nota_vino is 0.00 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity >= 9.6
##
## nota_vino is 0.17 when
## alcohol >= 11
## volatile_acidity < 0.87
##
## nota_vino is 0.28 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide < 4.2
##
## nota_vino is 0.60 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide >= 4.2
##
## nota_vino is 0.64 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity < 9.6
##
## nota_vino is 0.74 when
## alcohol < 11
## Log_sulphates < -0.47
##
## nota_vino is 0.75 when
## alcohol >= 11
## volatile_acidity >= 0.87
##
## nota_vino is 0.89 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides >= -2.3
Realizamos la valoración para una posible poda del modelo que permita simplificarlo y hacerlo más explicativo sin perder capacidad predictora. Para ello vemos el CP o “Parámetro de complejidad” con el cual buscamos el árbol menos profundo que además minimice la tasa de error.
plotcp(tree) #CP - PARÁMETRO DE COMPLEJIDAD: Buscamos el árbol menos profundo que además minimiza la tasa de error
printcp(tree)
##
## Classification tree:
## rpart(formula = nota_vino ~ ., data = train_tree, cp = 0.006)
##
## Variables actually used in tree construction:
## [1] alcohol fixed_acidity Log_chlorides
## [4] Log_sulphates Log_total_sulfur_dioxide volatile_acidity
##
## Root node error: 597/1279 = 0.46677
##
## n= 1279
##
## CP nsplit rel error xerror xstd
## 1 0.3668342 0 1.00000 1.00000 0.029886
## 2 0.0452261 1 0.63317 0.64824 0.027519
## 3 0.0201005 3 0.54271 0.59296 0.026802
## 4 0.0134003 4 0.52261 0.57119 0.026489
## 5 0.0067002 5 0.50921 0.55611 0.026262
## 6 0.0060000 7 0.49581 0.55444 0.026237
Finalmente decimos proceder a realizar la poda y crear un modelo alternativo más simplificado:
xerror <- tree$cptable[, "xerror"]
imin.xerror <- which.min(xerror)
upper.xerror <- xerror[imin.xerror] + tree$cptable[imin.xerror,
"xstd"]
icp <- min(which(xerror <= upper.xerror))
cp <- tree$cptable[icp, "CP"]
cp
## [1] 0.01340034
tree_2 <- prune(tree, cp = cp)
# tree summary(tree) caret::varImp(tree) importance <-
# tree$variable.importance importance <-
# round(100*importance/sum(importance), 1)
# importance[importance >= 1]
rpart.plot(tree_2, nn = TRUE, extra = 104, box.palette = "GnBu",
branch.lty = 3, shadow.col = "gray") #, main='Classification tree winetaste'
Pasamos a validar la capacidad predictora de nuestro modelo de árbol de decisión con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.
test_tree <- test[, colnames(test) != "quality"]
test_tree$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(test_tree)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
obs <- test_tree$nota_vino
head(predict(tree, newdata = test_tree))
## aprobado suspenso
## 1 0.2611336 0.7388664
## 2 0.2611336 0.7388664
## 3 0.2611336 0.7388664
## 4 0.2611336 0.7388664
## 5 0.2611336 0.7388664
## 6 0.2611336 0.7388664
pred <- predict(tree, newdata = test_tree, type = "class")
table(obs, pred)
## pred
## obs aprobado suspenso
## aprobado 112 61
## suspenso 38 109
caret::confusionMatrix(pred, obs)
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 112 38
## suspenso 61 109
##
## Accuracy : 0.6906
## 95% CI : (0.6368, 0.7409)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 3.112e-08
##
## Kappa : 0.3844
##
## Mcnemar's Test P-Value : 0.02703
##
## Sensitivity : 0.6474
## Specificity : 0.7415
## Pos Pred Value : 0.7467
## Neg Pred Value : 0.6412
## Prevalence : 0.5406
## Detection Rate : 0.3500
## Detection Prevalence : 0.4688
## Balanced Accuracy : 0.6944
##
## 'Positive' Class : aprobado
##
Obtenemos un valor del 69.06% para la precisión del modelo, con el incoveniente de tener un modelo sin poda, demasiado complejo y que puede tender al sobreajuste.
obs2 <- test_tree$nota_vino
head(predict(tree_2, newdata = test_tree))
## aprobado suspenso
## 1 0.2611336 0.7388664
## 2 0.2611336 0.7388664
## 3 0.2611336 0.7388664
## 4 0.2611336 0.7388664
## 5 0.2611336 0.7388664
## 6 0.2611336 0.7388664
pred2 <- predict(tree_2, newdata = test_tree, type = "class")
table(obs2, pred2)
## pred2
## obs2 aprobado suspenso
## aprobado 110 63
## suspenso 42 105
caret::confusionMatrix(pred2, obs2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 110 42
## suspenso 63 105
##
## Accuracy : 0.6719
## 95% CI : (0.6175, 0.7231)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 1.214e-06
##
## Kappa : 0.3464
##
## Mcnemar's Test P-Value : 0.05096
##
## Sensitivity : 0.6358
## Specificity : 0.7143
## Pos Pred Value : 0.7237
## Neg Pred Value : 0.6250
## Prevalence : 0.5406
## Detection Rate : 0.3438
## Detection Prevalence : 0.4750
## Balanced Accuracy : 0.6751
##
## 'Positive' Class : aprobado
##
Aplicando la poda a nuestro árbol obtenemos un modelo mas limpio, simple, explicativo y generalizable a otro conjunto de datos, evitando el posible sobreajuste del modelo y solo reduciendo su capacidad predictora a un valor de precisión del 66.25%. Entendemos que este modelo podado será el óptimo en este caso.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_forest <- train[, colnames(train) != "quality"]
train_forest$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso"))
train_forest
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
str(train_forest)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
Creamos el modelo de bosque de árboles:
set.seed(4343)
rf <- randomForest(train_forest$nota_vino ~ ., data = train_forest,
ntree = 200, do.trace = T, importance = T)
## ntree OOB 1 2
## 1: 29.89% 26.40% 33.78%
## 2: 31.23% 26.48% 36.19%
## 3: 30.63% 26.91% 34.62%
## 4: 29.61% 24.61% 35.21%
## 5: 27.61% 23.24% 32.48%
## 6: 28.97% 24.02% 34.57%
## 7: 28.50% 23.04% 34.66%
## 8: 27.68% 23.19% 32.76%
## 9: 26.71% 22.04% 31.98%
## 10: 26.66% 22.11% 31.82%
## 11: 25.94% 21.83% 30.64%
## 12: 26.29% 22.32% 30.82%
## 13: 24.55% 20.97% 28.64%
## 14: 24.71% 21.85% 27.97%
## 15: 23.92% 20.82% 27.47%
## 16: 23.69% 21.11% 26.63%
## 17: 23.46% 20.09% 27.30%
## 18: 22.83% 19.50% 26.63%
## 19: 22.28% 19.06% 25.96%
## 20: 23.22% 20.09% 26.80%
## 21: 22.60% 19.35% 26.30%
## 22: 23.14% 20.23% 26.47%
## 23: 22.83% 19.21% 26.97%
## 24: 22.44% 19.35% 25.96%
## 25: 22.20% 19.94% 24.79%
## 26: 21.74% 19.50% 24.29%
## 27: 21.97% 19.65% 24.62%
## 28: 22.05% 19.65% 24.79%
## 29: 21.97% 19.21% 25.13%
## 30: 22.28% 19.21% 25.80%
## 31: 22.13% 19.21% 25.46%
## 32: 22.13% 19.50% 25.13%
## 33: 22.05% 19.65% 24.79%
## 34: 21.66% 19.50% 24.12%
## 35: 21.89% 20.09% 23.95%
## 36: 21.27% 19.65% 23.12%
## 37: 21.81% 19.79% 24.12%
## 38: 21.66% 19.21% 24.46%
## 39: 21.81% 19.79% 24.12%
## 40: 21.66% 20.23% 23.28%
## 41: 20.80% 18.91% 22.95%
## 42: 20.88% 19.21% 22.78%
## 43: 20.80% 18.77% 23.12%
## 44: 20.95% 19.50% 22.61%
## 45: 21.19% 19.35% 23.28%
## 46: 21.03% 19.21% 23.12%
## 47: 20.88% 19.79% 22.11%
## 48: 19.94% 18.91% 21.11%
## 49: 20.56% 19.79% 21.44%
## 50: 20.48% 19.35% 21.78%
## 51: 20.33% 19.65% 21.11%
## 52: 20.33% 19.50% 21.27%
## 53: 20.09% 18.77% 21.61%
## 54: 20.17% 19.06% 21.44%
## 55: 20.25% 18.62% 22.11%
## 56: 20.02% 19.06% 21.11%
## 57: 20.02% 18.77% 21.44%
## 58: 20.02% 18.91% 21.27%
## 59: 20.17% 19.06% 21.44%
## 60: 19.78% 19.21% 20.44%
## 61: 19.47% 18.62% 20.44%
## 62: 19.94% 19.50% 20.44%
## 63: 19.62% 19.06% 20.27%
## 64: 19.78% 19.21% 20.44%
## 65: 19.94% 19.50% 20.44%
## 66: 20.41% 19.79% 21.11%
## 67: 20.33% 19.94% 20.77%
## 68: 20.17% 19.50% 20.94%
## 69: 20.02% 19.35% 20.77%
## 70: 20.17% 19.65% 20.77%
## 71: 20.09% 19.50% 20.77%
## 72: 20.17% 19.50% 20.94%
## 73: 20.33% 19.50% 21.27%
## 74: 20.33% 19.65% 21.11%
## 75: 20.02% 19.21% 20.94%
## 76: 19.78% 19.21% 20.44%
## 77: 19.70% 18.91% 20.60%
## 78: 19.86% 19.21% 20.60%
## 79: 19.62% 19.06% 20.27%
## 80: 19.62% 19.06% 20.27%
## 81: 19.39% 18.77% 20.10%
## 82: 19.70% 19.06% 20.44%
## 83: 19.62% 18.91% 20.44%
## 84: 19.94% 18.91% 21.11%
## 85: 19.78% 19.06% 20.60%
## 86: 19.70% 18.62% 20.94%
## 87: 19.78% 18.91% 20.77%
## 88: 19.94% 19.06% 20.94%
## 89: 20.09% 19.35% 20.94%
## 90: 20.09% 19.65% 20.60%
## 91: 20.09% 19.35% 20.94%
## 92: 20.17% 19.35% 21.11%
## 93: 20.02% 19.35% 20.77%
## 94: 20.02% 19.21% 20.94%
## 95: 20.02% 19.21% 20.94%
## 96: 20.09% 19.65% 20.60%
## 97: 20.33% 19.79% 20.94%
## 98: 20.41% 19.79% 21.11%
## 99: 20.09% 19.65% 20.60%
## 100: 19.94% 19.65% 20.27%
## 101: 20.02% 19.50% 20.60%
## 102: 20.02% 19.79% 20.27%
## 103: 20.25% 19.94% 20.60%
## 104: 19.86% 19.50% 20.27%
## 105: 19.94% 19.50% 20.44%
## 106: 20.09% 19.79% 20.44%
## 107: 19.94% 19.65% 20.27%
## 108: 20.25% 19.94% 20.60%
## 109: 20.09% 19.65% 20.60%
## 110: 20.25% 19.94% 20.60%
## 111: 20.48% 19.94% 21.11%
## 112: 20.02% 19.50% 20.60%
## 113: 20.09% 19.50% 20.77%
## 114: 20.09% 19.65% 20.60%
## 115: 20.25% 19.65% 20.94%
## 116: 20.56% 20.09% 21.11%
## 117: 20.09% 19.94% 20.27%
## 118: 19.94% 19.94% 19.93%
## 119: 20.02% 19.94% 20.10%
## 120: 19.94% 19.79% 20.10%
## 121: 20.09% 20.09% 20.10%
## 122: 19.94% 20.09% 19.77%
## 123: 20.02% 19.94% 20.10%
## 124: 20.09% 19.79% 20.44%
## 125: 20.02% 19.94% 20.10%
## 126: 20.02% 20.09% 19.93%
## 127: 19.86% 19.79% 19.93%
## 128: 20.17% 20.23% 20.10%
## 129: 20.25% 20.09% 20.44%
## 130: 20.48% 20.38% 20.60%
## 131: 20.25% 20.23% 20.27%
## 132: 20.56% 20.23% 20.94%
## 133: 20.25% 20.09% 20.44%
## 134: 20.17% 19.94% 20.44%
## 135: 19.78% 19.35% 20.27%
## 136: 20.09% 19.79% 20.44%
## 137: 19.94% 19.21% 20.77%
## 138: 19.62% 19.06% 20.27%
## 139: 19.62% 18.91% 20.44%
## 140: 19.94% 19.21% 20.77%
## 141: 20.17% 19.35% 21.11%
## 142: 20.25% 19.65% 20.94%
## 143: 20.25% 19.94% 20.60%
## 144: 19.78% 19.35% 20.27%
## 145: 20.09% 19.50% 20.77%
## 146: 20.02% 19.50% 20.60%
## 147: 19.78% 19.21% 20.44%
## 148: 19.78% 19.21% 20.44%
## 149: 19.94% 19.21% 20.77%
## 150: 19.86% 19.21% 20.60%
## 151: 19.78% 19.21% 20.44%
## 152: 19.62% 19.06% 20.27%
## 153: 19.94% 19.35% 20.60%
## 154: 19.94% 19.50% 20.44%
## 155: 19.78% 19.50% 20.10%
## 156: 19.94% 19.65% 20.27%
## 157: 19.70% 19.35% 20.10%
## 158: 19.55% 19.50% 19.60%
## 159: 19.86% 19.79% 19.93%
## 160: 19.47% 19.35% 19.60%
## 161: 19.86% 19.65% 20.10%
## 162: 19.70% 19.35% 20.10%
## 163: 19.86% 19.35% 20.44%
## 164: 20.09% 19.65% 20.60%
## 165: 19.94% 19.21% 20.77%
## 166: 20.02% 19.65% 20.44%
## 167: 20.02% 19.35% 20.77%
## 168: 20.02% 19.35% 20.77%
## 169: 20.17% 19.50% 20.94%
## 170: 19.94% 19.50% 20.44%
## 171: 20.02% 19.65% 20.44%
## 172: 19.94% 19.65% 20.27%
## 173: 19.78% 19.65% 19.93%
## 174: 19.94% 19.65% 20.27%
## 175: 19.94% 19.65% 20.27%
## 176: 20.33% 20.09% 20.60%
## 177: 20.02% 19.65% 20.44%
## 178: 20.09% 19.79% 20.44%
## 179: 19.94% 19.79% 20.10%
## 180: 20.25% 20.09% 20.44%
## 181: 20.25% 20.09% 20.44%
## 182: 20.09% 19.79% 20.44%
## 183: 20.33% 19.65% 21.11%
## 184: 20.41% 19.79% 21.11%
## 185: 20.41% 19.79% 21.11%
## 186: 20.09% 19.79% 20.44%
## 187: 20.33% 19.94% 20.77%
## 188: 20.25% 20.09% 20.44%
## 189: 20.41% 20.23% 20.60%
## 190: 20.33% 20.09% 20.60%
## 191: 20.41% 20.09% 20.77%
## 192: 20.09% 19.94% 20.27%
## 193: 20.17% 20.09% 20.27%
## 194: 20.09% 19.79% 20.44%
## 195: 20.09% 20.09% 20.10%
## 196: 20.02% 19.94% 20.10%
## 197: 19.62% 19.65% 19.60%
## 198: 19.94% 19.79% 20.10%
## 199: 20.17% 20.23% 20.10%
## 200: 19.94% 19.79% 20.10%
rf
##
## Call:
## randomForest(formula = train_forest$nota_vino ~ ., data = train_forest, ntree = 200, do.trace = T, importance = T)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.94%
## Confusion matrix:
## aprobado suspenso class.error
## aprobado 547 135 0.1979472
## suspenso 120 477 0.2010050
Examinamos la convergencia del error en las muestras:
plot(rf,main="")
legend("right", colnames(rf$err.rate), lty = 1:5, col = 1:6)
Vemos la relevancia de las variables en el modelo (vemos que la variable clave que más afecta al accuracy del modelo es “alcohol”)
varImpPlot(rf)
Pasamos a validar la capacidad predictora de nuestro modelo de árbol de decisión con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.
test_forest <- test[, colnames(test) != "quality"]
test_forest$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(test_forest)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
Realizamos la predicción sobre los datos de test:
pred2222<-predict(rf, newdata=test_forest, type="class")
confusionMatrix(pred2222, test_forest$nota_vino)
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 130 33
## suspenso 43 114
##
## Accuracy : 0.7625
## 95% CI : (0.712, 0.8081)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5243
##
## Mcnemar's Test P-Value : 0.3019
##
## Sensitivity : 0.7514
## Specificity : 0.7755
## Pos Pred Value : 0.7975
## Neg Pred Value : 0.7261
## Prevalence : 0.5406
## Detection Rate : 0.4062
## Detection Prevalence : 0.5094
## Balanced Accuracy : 0.7635
##
## 'Positive' Class : aprobado
##
Conseguimos un accuracy del 76.25% en test.
https://rubenfcasal.github.io/aprendizaje_estadistico/boosting-en-r.html
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_en <- train[, colnames(train) != "quality"]
train_en$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_en)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
Creamos el modelo de boosting con una configuración inicial de parámetros:
ada.boost <- ada(nota_vino ~ ., data = train_en, type = "real",
control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10, xval = 0),
iter = 150, nu = 0.05)
ada.boost
## Call:
## ada(nota_vino ~ ., data = train_en, type = "real", control = rpart.control(maxdepth = 2,
## cp = 0, minsplit = 10, xval = 0), iter = 150, nu = 0.05)
##
## Loss: exponential Method: real Iteration: 150
##
## Final Confusion Matrix for Data:
## Final Prediction
## True value aprobado suspenso
## aprobado 531 151
## suspenso 116 481
##
## Train Error: 0.209
##
## Out-Of-Bag Error: 0.221 iteration= 142
##
## Additional Estimates of number of iterations:
##
## train.err1 train.kap1
## 149 149
Vemos la evolución decreciente del error al aumentar el número de iteraciones en el modelo
plot(ada.boost)
Evaluamos la precisión del modelo en la muestra de test:
test_en <- test[, colnames(test) != "quality"]
test_en$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(test_en)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
set.seed(123)
pred_ada <- predict(ada.boost, newdata = test_en)
caret::confusionMatrix(pred_ada, test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 117 39
## suspenso 56 108
##
## Accuracy : 0.7031
## 95% CI : (0.6498, 0.7527)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 2.039e-09
##
## Kappa : 0.4075
##
## Mcnemar's Test P-Value : 0.1007
##
## Sensitivity : 0.6763
## Specificity : 0.7347
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.6585
## Prevalence : 0.5406
## Detection Rate : 0.3656
## Detection Prevalence : 0.4875
## Balanced Accuracy : 0.7055
##
## 'Positive' Class : aprobado
##
Con la configuración de parámetros realizada en el modelo ada de booting obtenemos un valor de accuracy del 70,31% para el caso de algoritmos de clasificación.
Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros:
modelLookup("ada")
## model parameter label forReg forClass probModel
## 1 ada iter #Trees FALSE TRUE TRUE
## 2 ada maxdepth Max Tree Depth FALSE TRUE TRUE
## 3 ada nu Learning Rate FALSE TRUE TRUE
Vemos los parámetros de “iter”, “maxdepth” y “nu” que tiene el modelo ada de boosting para árboles de decisión en problemas de clasificación.
set.seed(123)
caret.ada0 <- train(nota_vino ~ ., method = "ada", data = train_en,
trControl = trainControl(method = "cv", number = 5))
caret.ada0
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7397072 0.4808586
## 1 100 0.7600076 0.5185542
## 1 150 0.7623483 0.5226994
## 2 50 0.7514108 0.5003883
## 2 100 0.7654825 0.5288044
## 2 150 0.7592111 0.5162555
## 3 50 0.7584145 0.5153590
## 3 100 0.7583932 0.5153576
## 3 150 0.7544839 0.5072981
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 2 and nu = 0.1.
Obtenemos una configuración óptima de los hiperparámetros del modelo en “iter” = 100, “maxdepth” = 2 y “nu” = 0.1.
confusionMatrix(predict(caret.ada0, newdata = test_en), test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 120 42
## suspenso 53 105
##
## Accuracy : 0.7031
## 95% CI : (0.6498, 0.7527)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 2.039e-09
##
## Kappa : 0.4056
##
## Mcnemar's Test P-Value : 0.3049
##
## Sensitivity : 0.6936
## Specificity : 0.7143
## Pos Pred Value : 0.7407
## Neg Pred Value : 0.6646
## Prevalence : 0.5406
## Detection Rate : 0.3750
## Detection Prevalence : 0.5062
## Balanced Accuracy : 0.7040
##
## 'Positive' Class : aprobado
##
Con el modelo de base obtenemos un accuracy del 70.31% con los datos de test.
Tratamos de añadir al modelo base la configuración de hiperparámetros óptima:
set.seed(123)
caret.ada1 <- train(nota_vino ~ ., method = "ada", data = train_en,
tuneGrid = data.frame(iter = 100, maxdepth = 2, nu = c(0.1)),
trControl = trainControl(method = "cv", number = 5))
caret.ada1
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022
## Resampling results:
##
## Accuracy Kappa
## 0.7631112 0.5249703
##
## Tuning parameter 'iter' was held constant at a value of 100
## Tuning
## parameter 'maxdepth' was held constant at a value of 2
## Tuning parameter
## 'nu' was held constant at a value of 0.1
Conseguimos un accuracy del 76.31% en train.
confusionMatrix(predict(caret.ada1, newdata = test_en), test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 123 40
## suspenso 50 107
##
## Accuracy : 0.7188
## 95% CI : (0.6661, 0.7673)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 4.863e-11
##
## Kappa : 0.4366
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.7110
## Specificity : 0.7279
## Pos Pred Value : 0.7546
## Neg Pred Value : 0.6815
## Prevalence : 0.5406
## Detection Rate : 0.3844
## Detection Prevalence : 0.5094
## Balanced Accuracy : 0.7194
##
## 'Positive' Class : aprobado
##
Conseguimos un accuracy del 71.88% en test.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_xgb <- train[, colnames(train) != "quality"]
train_xgb$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_xgb)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros:
modelLookup("xgbTree")
## model parameter label forReg forClass
## 1 xgbTree nrounds # Boosting Iterations TRUE TRUE
## 2 xgbTree max_depth Max Tree Depth TRUE TRUE
## 3 xgbTree eta Shrinkage TRUE TRUE
## 4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
## 5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
## 7 xgbTree subsample Subsample Percentage TRUE TRUE
## probModel
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
Creamos el modelo de boosting con una configuración inicial dada de parámetros:
set.seed(2)
caret.xgb <- train(nota_vino ~ ., method = "xgbTree", data = train_xgb,
trControl = trainControl(method = "cv", number = 5))
## [18:39:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:22] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:23] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:24] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:25] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:26] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:27] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:28] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:29] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:30] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:31] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:32] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:33] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:34] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:35] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:39:36] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
caret.xgb
## eXtreme Gradient Boosting
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1023, 1024, 1023, 1023
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy Kappa
## 0.3 1 0.6 0.50 50 0.7584252 0.5157983
## 0.3 1 0.6 0.50 100 0.7615594 0.5214823
## 0.3 1 0.6 0.50 150 0.7592126 0.5168812
## 0.3 1 0.6 0.75 50 0.7685784 0.5357000
## 0.3 1 0.6 0.75 100 0.7631158 0.5251201
## 0.3 1 0.6 0.75 150 0.7662439 0.5317973
## 0.3 1 0.6 1.00 50 0.7709252 0.5405230
## 0.3 1 0.6 1.00 100 0.7631189 0.5252044
## 0.3 1 0.6 1.00 150 0.7623346 0.5234812
## 0.3 1 0.8 0.50 50 0.7529473 0.5043059
## 0.3 1 0.8 0.50 100 0.7599847 0.5184122
## 0.3 1 0.8 0.50 150 0.7584038 0.5151712
## 0.3 1 0.8 0.75 50 0.7670221 0.5322787
## 0.3 1 0.8 0.75 100 0.7709314 0.5404779
## 0.3 1 0.8 0.75 150 0.7685876 0.5355394
## 0.3 1 0.8 1.00 50 0.7670129 0.5328119
## 0.3 1 0.8 1.00 100 0.7623346 0.5236110
## 0.3 1 0.8 1.00 150 0.7623346 0.5238734
## 0.3 2 0.6 0.50 50 0.7693566 0.5379810
## 0.3 2 0.6 0.50 100 0.7810784 0.5609420
## 0.3 2 0.6 0.50 150 0.7756189 0.5492785
## 0.3 2 0.6 0.75 50 0.7771814 0.5531543
## 0.3 2 0.6 0.75 100 0.7787561 0.5555964
## 0.3 2 0.6 0.75 150 0.7779687 0.5546161
## 0.3 2 0.6 1.00 50 0.7709252 0.5415508
## 0.3 2 0.6 1.00 100 0.7834375 0.5660762
## 0.3 2 0.6 1.00 150 0.7818781 0.5625740
## 0.3 2 0.8 0.50 50 0.7662347 0.5309808
## 0.3 2 0.8 0.50 100 0.7795313 0.5579521
## 0.3 2 0.8 0.50 150 0.7740472 0.5468629
## 0.3 2 0.8 0.75 50 0.7615319 0.5223741
## 0.3 2 0.8 0.75 100 0.7717218 0.5423465
## 0.3 2 0.8 0.75 150 0.7693750 0.5372122
## 0.3 2 0.8 1.00 50 0.7740472 0.5483062
## 0.3 2 0.8 1.00 100 0.7842279 0.5672228
## 0.3 2 0.8 1.00 150 0.7709314 0.5403959
## 0.3 3 0.6 0.50 50 0.7779779 0.5544845
## 0.3 3 0.6 0.50 100 0.7662500 0.5300660
## 0.3 3 0.6 0.50 150 0.7725000 0.5422163
## 0.3 3 0.6 0.75 50 0.7779687 0.5537440
## 0.3 3 0.6 0.75 100 0.7857874 0.5685593
## 0.3 3 0.6 0.75 150 0.7826532 0.5626896
## 0.3 3 0.6 1.00 50 0.7701409 0.5392758
## 0.3 3 0.6 1.00 100 0.7818842 0.5616341
## 0.3 3 0.6 1.00 150 0.7873529 0.5723541
## 0.3 3 0.8 0.50 50 0.7623529 0.5230456
## 0.3 3 0.8 0.50 100 0.7631189 0.5236251
## 0.3 3 0.8 0.50 150 0.7779657 0.5535856
## 0.3 3 0.8 0.75 50 0.7716881 0.5413281
## 0.3 3 0.8 0.75 100 0.7818689 0.5614994
## 0.3 3 0.8 0.75 150 0.7818627 0.5613169
## 0.3 3 0.8 1.00 50 0.7810999 0.5609327
## 0.3 3 0.8 1.00 100 0.7912561 0.5808283
## 0.3 3 0.8 1.00 150 0.7912561 0.5802007
## 0.4 1 0.6 0.50 50 0.7545129 0.5074292
## 0.4 1 0.6 0.50 100 0.7607690 0.5204900
## 0.4 1 0.6 0.50 150 0.7560662 0.5113351
## 0.4 1 0.6 0.75 50 0.7623162 0.5232939
## 0.4 1 0.6 0.75 100 0.7607567 0.5197020
## 0.4 1 0.6 0.75 150 0.7678002 0.5336260
## 0.4 1 0.6 1.00 50 0.7599724 0.5186811
## 0.4 1 0.6 1.00 100 0.7576409 0.5144002
## 0.4 1 0.6 1.00 150 0.7584283 0.5156999
## 0.4 1 0.8 0.50 50 0.7545129 0.5080772
## 0.4 1 0.8 0.50 100 0.7568597 0.5132168
## 0.4 1 0.8 0.50 150 0.7678033 0.5339583
## 0.4 1 0.8 0.75 50 0.7646752 0.5283858
## 0.4 1 0.8 0.75 100 0.7631189 0.5249673
## 0.4 1 0.8 0.75 150 0.7678186 0.5340579
## 0.4 1 0.8 1.00 50 0.7568474 0.5124567
## 0.4 1 0.8 1.00 100 0.7631097 0.5251788
## 0.4 1 0.8 1.00 150 0.7709252 0.5411619
## 0.4 2 0.6 0.50 50 0.7591912 0.5170492
## 0.4 2 0.6 0.50 100 0.7592034 0.5159729
## 0.4 2 0.6 0.50 150 0.7607659 0.5191407
## 0.4 2 0.6 0.75 50 0.7662439 0.5317023
## 0.4 2 0.6 0.75 100 0.7685907 0.5357065
## 0.4 2 0.6 0.75 150 0.7670313 0.5316144
## 0.4 2 0.6 1.00 50 0.7717065 0.5422725
## 0.4 2 0.6 1.00 100 0.7818689 0.5627269
## 0.4 2 0.6 1.00 150 0.7803002 0.5587702
## 0.4 2 0.8 0.50 50 0.7670221 0.5322487
## 0.4 2 0.8 0.50 100 0.7756281 0.5490931
## 0.4 2 0.8 0.50 150 0.7693811 0.5366440
## 0.4 2 0.8 0.75 50 0.7615380 0.5209873
## 0.4 2 0.8 0.75 100 0.7725031 0.5429529
## 0.4 2 0.8 0.75 150 0.7834467 0.5652049
## 0.4 2 0.8 1.00 50 0.7771875 0.5531451
## 0.4 2 0.8 1.00 100 0.7724969 0.5436643
## 0.4 2 0.8 1.00 150 0.7771906 0.5527097
## 0.4 3 0.6 0.50 50 0.7599939 0.5173947
## 0.4 3 0.6 0.50 100 0.7639185 0.5247570
## 0.4 3 0.6 0.50 150 0.7654534 0.5279526
## 0.4 3 0.6 0.75 50 0.7600092 0.5180450
## 0.4 3 0.6 0.75 100 0.7795435 0.5569284
## 0.4 3 0.6 0.75 150 0.7779718 0.5536798
## 0.4 3 0.6 1.00 50 0.7701562 0.5381242
## 0.4 3 0.6 1.00 100 0.7810938 0.5604603
## 0.4 3 0.6 1.00 150 0.7795404 0.5570991
## 0.4 3 0.8 0.50 50 0.7670190 0.5321376
## 0.4 3 0.8 0.50 100 0.7725061 0.5421105
## 0.4 3 0.8 0.50 150 0.7756281 0.5483835
## 0.4 3 0.8 0.75 50 0.7701562 0.5376402
## 0.4 3 0.8 0.75 100 0.7803033 0.5580894
## 0.4 3 0.8 0.75 150 0.7865686 0.5709177
## 0.4 3 0.8 1.00 50 0.7810999 0.5600757
## 0.4 3 0.8 1.00 100 0.7850184 0.5680617
## 0.4 3 0.8 1.00 150 0.7818811 0.5616134
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3, eta
## = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
## = 1.
Para optimizar los resultados del modelo creado, realizamos un ajuste de hiperparámetros con los valores obtenidos:
caret.xgb$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 53 100 3 0.3 0 0.8 1 1
Obtenemos una configuración óptima de los hiperparámetros del modelo en nrounds = 100, max_depth = 3, eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight= 1 y subsample = 1.
Vemos la relevancia de cada variable en el modelo:
varImp(caret.xgb)
## xgbTree variable importance
##
## Overall
## alcohol 100.000
## Log_sulphates 47.822
## volatile_acidity 44.375
## Log_total_sulfur_dioxide 31.256
## density 29.242
## Log_chlorides 23.118
## pH 9.363
## citric_acid 6.351
## fixed_acidity 3.858
## Log_free_sulfur_dioxide 1.028
## Log_residual_sugar 0.000
Probamos el modelo de base con los datos de test:
test_xgb <- test[, colnames(test) != "quality"]
test_xgb$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(test_xgb)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
confusionMatrix(predict(caret.xgb, newdata = test_xgb), test_xgb$nota_vino)
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 117 35
## suspenso 56 112
##
## Accuracy : 0.7156
## 95% CI : (0.6628, 0.7644)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 1.058e-10
##
## Kappa : 0.4336
##
## Mcnemar's Test P-Value : 0.03603
##
## Sensitivity : 0.6763
## Specificity : 0.7619
## Pos Pred Value : 0.7697
## Neg Pred Value : 0.6667
## Prevalence : 0.5406
## Detection Rate : 0.3656
## Detection Prevalence : 0.4750
## Balanced Accuracy : 0.7191
##
## 'Positive' Class : aprobado
##
Se obtiene un valor del accuracy en test de 71.56%
Tratamos de añadir al modelo base la configuración de hiperparámetros óptima obtenida con anterioridad:
set.seed(1)
caret.xgb1 <- train(nota_vino ~ ., method = "xgbTree", data = train_xgb,
tuneGrid = data.frame(nrounds = 100, max_depth = 3, eta = 0.3,
gamma = 0, colsample_bytree = 0.8, min_child_weight = 1,
subsample = 1), trControl = trainControl(method = "cv",
number = 5))
caret.xgb1
## eXtreme Gradient Boosting
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022
## Resampling results:
##
## Accuracy Kappa
## 0.7763987 0.5515202
##
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
## held constant at a value of 1
## Tuning parameter 'subsample' was held
## constant at a value of 1
Obtenemos un valor del accuracy con la configuración de parámetros óptima en train de 77.63%
confusionMatrix(predict(caret.xgb1, newdata = test_xgb), test_xgb$nota_vino)
## Confusion Matrix and Statistics
##
## Reference
## Prediction aprobado suspenso
## aprobado 124 36
## suspenso 49 111
##
## Accuracy : 0.7344
## 95% CI : (0.6824, 0.782)
## No Information Rate : 0.5406
## P-Value [Acc > NIR] : 7.942e-13
##
## Kappa : 0.4688
##
## Mcnemar's Test P-Value : 0.1931
##
## Sensitivity : 0.7168
## Specificity : 0.7551
## Pos Pred Value : 0.7750
## Neg Pred Value : 0.6937
## Prevalence : 0.5406
## Detection Rate : 0.3875
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.7359
##
## 'Positive' Class : aprobado
##
Obtenemos un valor del accuracy con la configuración de parámetros óptima en test de 73.44%.
Utilizamos el paquete kernlab para crear nuestro algoritmo SVM y entrenamos nuestro modelo con la función train() del paquete carret.
En nuestro dataset de train y test, hemos creado la variable binaria “nota_vino”para que, en función de “quality,nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_svm <- train[, colnames(train)!="quality"]
train_svm$nota_vino <- factor(train$quality < 6, labels = c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
train_svm
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
Creamos un modelo SVM Lineal con todos los predictores de nuestro data set.
set.seed(13)
modelo_svmlineal <- train(nota_vino ~ ., method = "svmLinear", data = train_svm)
modelo_svmlineal$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 731
##
## Objective Function Value : -722.4628
## Training error : 0.244722
Con el objeto finalModel podemos observar cuales son los hiperparámetros utilizados (cost C) y el error de entrenamiento. Este error se corresponde con el error que comete mi modelo al intentar predicir las mismas observaciones con las que se ha entrenado. En nuestro caso tenemos un 24,44% de error.
Intentamos ajustar y evaluar nuestro modelo múltiples veces con distintos subconjuntos creados a partir de los datos de entrenamiento mediante Cross Validation, obteniendo para cada repetición una estimación del error. Cuando se aplican estos métodos, el coste computacional de ajustar múltiplas veces un modelo es alto y por eso con caret, podemos paralelizar el proceso para que sea más rápido.
#paralelización
registerDoMC(cores = 4)
#número de repeticiones para realizar la validación cruzada
particiones <- 10
repeticiones <- 5
#modelo
control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter = FALSE,allowParallel = TRUE)
set.seed(342)
modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,
method = "svmLinear",
metric = "Accuracy",
trControl = control_modelosvm_lineal)
modelo_svmlineal
## Support Vector Machines with Linear Kernel
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1151, 1151, 1150, 1151, 1152, 1151, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7513235 0.5032292
##
## Tuning parameter 'C' was held constant at a value of 1
Con esta validación cruzada con 10 particiones y 5 repeticiones hemos ajustado nuestro modelo 50 veces. Podemos pintar una gráfica con el accuracy obtenido en cada uno de estos modelos.
grafo1 <- ggplot(data = modelo_svmlineal$resample, aes(x = Accuracy)) + geom_density(alpha = 0.5, fill = "blue") +geom_vline(xintercept = mean(modelo_svmlineal$resample$Accuracy),linetype = "dashed") + theme_bw()
grafo2 <- ggplot(data = modelo_svmlineal$resample, aes(x = 1, y = Accuracy)) +geom_boxplot(outlier.shape = NA, alpha = 0.5, fill = "blue") +
geom_jitter(width = 0.05) +labs(x = "") +theme_bw() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
final_plot_svm <- ggarrange(grafo1, grafo2)
final_plot_svm <- annotate_figure(final_plot_svm,top = text_grob("Accuracy obtenido en la validación", size = 15))
final_plot_svm
Con validación cruzada llegamos a un accuracy promedio de 0.7513. Eso es lo mismo que decir que, mi modelo SVM Lineal predice si un vino es bueno o malo 75% de las veces.
Como hemos visto, nuestro modelo svmLinear tiene un hiperparámetro llamado coste (C). volveremos a ajustar nuestro modelo con diferentes modelos de C y aplicamos validación cruzada otra vez, para volver a identificar en cuál de los submodelos se obtiene el mejor resultado.
Para los diferentes valores de C, hemos elegido trabajar con grid search donde se especifican los valores exactos de los hiperparámetros.
#paralelización
registerDoMC(cores = 4)
#hiperparámetros y número de repeticiones
particiones <- 10
repeticiones <- 5
hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))
#modelo
control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter =FALSE,allowParallel = TRUE)
set.seed(342)
modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,method = "svmLinear",tuneGrid = hiperparametros,metric = "Accuracy",trControl =control_modelosvm_lineal)
modelo_svmlineal
## Support Vector Machines with Linear Kernel
##
## 1279 samples
## 11 predictor
## 2 classes: 'aprobado', 'suspenso'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1151, 1151, 1150, 1151, 1152, 1151, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 1e-03 0.7305088 0.4638641
## 1e-02 0.7525735 0.5062830
## 1e-01 0.7499246 0.5006288
## 5e-01 0.7506985 0.5019843
## 1e+00 0.7513235 0.5032292
## 1e+01 0.7516373 0.5038068
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
Podemos observar que en nuestro caso de todos los valores de C utilizados, C=0.01 es lo que nos devuelve los mejores resultados con un accuracy de 0.7526.
Observamos la variación la variación de nuestros accuracy para cada uno de los valores de C en el grafo abajo.
ggplot(data = modelo_svmlineal$resample,aes(x = as.factor(C), y = Accuracy, color = as.factor(C))) +geom_boxplot(outlier.shape = NA, alpha = 0.6) +
geom_jitter(width = 0.2, alpha = 0.6) + geom_hline(yintercept = 0.62, linetype = "dashed") +labs(x = "C") + theme_bw() + theme(legend.position = "none")
También podemos observar la evolución de los modelos según los valores de hiperparámetros que hemos elegido.
ggplot(modelo_svmlineal, highlight = TRUE) +labs(title = "Evolución accuracy en función de C") + theme_bw()
Quitamos la variable respuesta “quality”:
train_kmeans <- train[, -6]
train_kmeans
## # A tibble: 1,279 x 11
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>
Buscamos el valor óptimo de cluster a tener en nuestro modelo:
fviz_nbclust(train_kmeans, kmeans, method = "wss")
Otra forma de buscar el óptimo:
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(train_kmeans,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
Desarrollamos el Clustering con K-means con el número óptimo de K:
#make this example reproducible
set.seed(666)
km <- kmeans(train_kmeans, centers = 3, nstart = 25)
#view results
km
## K-means clustering with 3 clusters of sizes 615, 352, 312
##
## Cluster means:
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## 1 7.798862 0.5699187 0.2045041 0.9969913 9.696016 3.316846
## 2 7.110227 0.5246023 0.2134659 0.9948793 11.558286 3.398239
## 3 10.864103 0.4420192 0.4761218 0.9984547 10.610791 3.189583
## Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1 0.8186424 -2.434321 2.701835
## 2 0.8225008 -2.642422 2.513108
## 3 0.9650535 -2.460537 2.270269
## Log_total_sulfur_dioxide Log_sulphates
## 1 3.861476 -0.4820557
## 2 3.374945 -0.4497846
## 3 3.353542 -0.3855143
##
## Clustering vector:
## [1] 2 1 2 1 1 1 1 2 2 2 2 3 1 2 3 1 3 2 2 2 3 2 1 1 2 3 1 3 2 3 1 1 2 2 1 3 1
## [38] 1 3 3 2 1 2 3 2 2 3 1 1 1 1 1 3 2 2 3 2 1 3 1 1 2 1 3 3 1 2 1 3 2 1 3 1 1
## [75] 1 3 1 2 2 1 3 2 3 1 1 1 1 1 1 2 1 2 2 2 1 1 1 3 3 3 1 3 1 2 2 1 1 1 1 3 2
## [112] 1 1 1 1 1 3 1 3 1 1 2 1 1 2 3 2 1 2 1 2 1 2 2 1 2 1 2 2 3 1 2 2 1 3 2 1 1
## [149] 3 3 1 1 2 2 1 1 3 2 2 3 3 1 3 1 2 1 1 2 1 1 1 2 1 1 1 3 1 2 3 2 2 1 1 2 3
## [186] 3 3 1 3 1 1 3 1 2 1 2 2 3 2 2 1 2 1 2 3 2 2 2 1 1 1 2 3 3 2 3 2 3 1 3 2 1
## [223] 1 1 3 2 1 1 2 1 3 2 1 1 2 3 3 3 2 1 1 2 1 2 2 1 1 3 3 1 2 2 3 3 1 2 1 2 3
## [260] 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 3 3 1 1 3 3 2 3 2 2 3 2 1 1 1 1 2 1 1 3 2 2
## [297] 2 2 3 1 3 1 1 3 3 3 2 1 1 2 1 2 1 1 3 3 1 1 2 1 2 1 3 2 2 3 3 2 1 3 3 1 1
## [334] 1 1 3 3 3 3 1 1 1 3 1 3 2 1 2 1 1 1 1 3 1 2 1 2 1 1 1 2 2 3 1 2 1 1 1 3 1
## [371] 1 1 3 1 2 1 1 1 2 1 1 3 3 1 2 3 1 1 1 3 3 1 1 2 1 2 1 2 1 1 3 2 3 2 3 1 2
## [408] 2 2 1 1 1 1 1 1 3 1 1 2 1 1 3 3 1 2 1 3 3 1 1 3 1 3 3 2 3 1 2 3 1 2 1 3 2
## [445] 2 3 3 1 2 2 2 3 1 2 2 3 1 1 2 1 3 1 1 2 3 1 3 2 1 1 2 1 3 2 2 3 1 2 1 2 3
## [482] 1 1 3 3 2 3 2 3 1 1 3 1 1 3 1 1 3 1 3 2 2 1 1 1 3 1 1 1 1 3 2 3 3 3 3 1 3
## [519] 1 2 1 3 1 1 1 2 2 1 3 1 2 3 1 2 3 2 2 1 1 2 1 2 2 2 2 2 2 2 2 1 3 3 2 3 2
## [556] 1 1 3 2 1 1 1 1 2 2 3 1 2 1 3 3 3 1 1 1 1 2 1 2 3 1 1 2 1 1 1 2 1 1 1 1 3
## [593] 2 3 2 3 3 1 1 1 3 1 3 2 2 2 2 1 3 1 1 1 2 3 1 1 3 2 2 1 3 2 1 1 3 1 1 1 3
## [630] 2 3 1 3 3 3 2 1 3 1 3 1 1 1 3 1 2 1 1 2 1 1 2 1 1 3 3 1 3 1 1 1 2 3 3 1 3
## [667] 1 1 1 3 2 1 1 1 2 1 2 1 3 3 1 1 1 1 1 2 1 2 1 1 1 2 2 3 1 2 3 2 2 1 1 1 1
## [704] 3 3 1 2 1 1 3 1 1 1 2 1 2 3 2 3 1 2 3 1 1 2 1 1 1 1 1 1 1 2 1 1 1 3 1 3 1
## [741] 3 1 2 1 3 2 1 2 1 3 2 2 1 1 3 1 2 2 3 2 1 1 2 1 2 3 2 1 1 3 1 3 3 1 3 3 2
## [778] 3 1 1 2 3 1 1 2 1 1 1 1 3 3 1 1 3 2 1 1 3 3 1 3 2 2 2 1 2 1 1 2 3 1 3 1 1
## [815] 2 1 1 2 1 1 1 3 2 1 2 1 1 3 2 1 2 3 1 1 1 3 3 1 2 1 2 1 3 3 1 1 3 2 1 2 2
## [852] 1 1 3 3 3 2 1 1 1 1 3 1 1 2 1 1 2 1 2 2 3 2 1 3 2 1 2 2 2 1 2 1 1 2 2 1 1
## [889] 1 1 2 3 2 3 3 1 1 3 1 2 3 1 1 1 2 1 2 1 3 1 2 1 1 2 1 1 3 2 2 2 1 2 1 1 2
## [926] 3 3 3 1 3 1 1 1 2 1 1 1 3 3 1 1 1 1 2 1 1 1 3 3 3 2 1 1 1 3 1 1 3 2 2 1 3
## [963] 1 1 3 2 2 1 3 1 3 1 1 1 1 3 3 1 1 1 1 1 3 1 2 1 1 2 2 3 1 2 1 2 1 1 3 2 1
## [1000] 3 1 2 1 1 1 2 3 1 2 3 1 1 1 3 3 3 1 1 1 1 2 1 2 3 1 3 1 1 3 1 1 1 1 1 1 3
## [1037] 3 2 3 1 2 1 3 1 1 3 1 1 1 3 2 1 1 3 2 1 1 2 3 1 2 1 1 1 3 1 3 1 3 2 2 1 2
## [1074] 1 1 3 2 2 1 2 3 1 1 3 1 1 1 1 3 2 1 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 3 3 2 3
## [1111] 2 1 1 2 1 1 3 2 2 3 3 3 1 2 1 1 3 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [1148] 1 1 3 1 3 1 2 1 3 1 1 2 1 2 3 1 2 1 2 3 3 1 3 1 1 2 1 2 2 1 3 3 3 1 2 1 1
## [1185] 1 1 1 3 2 1 2 3 2 1 1 1 1 1 1 3 2 2 2 2 1 1 1 3 1 2 1 2 1 1 2 1 1 1 3 3 2
## [1222] 2 1 2 2 1 2 2 3 1 3 1 1 1 2 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 1 2 1 1 3 3 2 2
## [1259] 3 3 1 3 3 1 3 1 2 3 3 2 2 3 3 1 3 2 3 1 1
##
## Within cluster sum of squares by cluster:
## [1] 1246.609 1065.266 1221.854
## (between_SS / total_SS = 50.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Graficamos los resultados obtenidos:
fviz_cluster(km, data = train_kmeans, geom = "point")
Vemos la media de los valores para cada uno de los diferentes clusters y su tamaño:
aggregate(train_kmeans, by=list(cluster=km$cluster), mean)
## cluster fixed_acidity volatile_acidity citric_acid density alcohol
## 1 1 7.798862 0.5699187 0.2045041 0.9969913 9.696016
## 2 2 7.110227 0.5246023 0.2134659 0.9948793 11.558286
## 3 3 10.864103 0.4420192 0.4761218 0.9984547 10.610791
## pH Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1 3.316846 0.8186424 -2.434321 2.701835
## 2 3.398239 0.8225008 -2.642422 2.513108
## 3 3.189583 0.9650535 -2.460537 2.270269
## Log_total_sulfur_dioxide Log_sulphates
## 1 3.861476 -0.4820557
## 2 3.374945 -0.4497846
## 3 3.353542 -0.3855143
km$size
## [1] 615 352 312
table(km$cluster, train$quality)
##
## 3 4 5 6 7 8
## 1 2 16 393 188 15 1
## 2 3 18 71 178 74 8
## 3 2 4 88 147 67 4
wholesaleBest = FitKMeans(train_kmeans, max.clusters = 10, nstart = 25, seed = 666)
wholesaleBest
## Clusters Hartigan AddCluster
## 1 2 783.50650 TRUE
## 2 3 332.10622 TRUE
## 3 4 198.08116 TRUE
## 4 5 178.29780 TRUE
## 5 6 153.88342 TRUE
## 6 7 88.19689 TRUE
## 7 8 85.66108 TRUE
## 8 9 87.23972 TRUE
## 9 10 74.03035 TRUE
PlotHartigan(wholesaleBest)
¿Será con variable respuesta continua?
gam_mod <- gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + s(citric_acid) +
s(fixed_acidity), data = train, method = "REML")
gam_mod
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Estimated degrees of freedom:
## 3.80 1.00 3.69 1.00 2.25 2.94 1.99
## 2.55 total = 20.23
##
## REML score: 1221.011
coef(gam_mod)
## (Intercept) s(alcohol).1
## 5.634871e+00 1.648020e-01
## s(alcohol).2 s(alcohol).3
## -1.165238e-01 7.041699e-02
## s(alcohol).4 s(alcohol).5
## -9.552495e-02 -4.469214e-02
## s(alcohol).6 s(alcohol).7
## 1.131493e-01 -4.810090e-02
## s(alcohol).8 s(alcohol).9
## 6.033374e-01 8.274343e-02
## s(volatile_acidity).1 s(volatile_acidity).2
## -4.570556e-05 1.038005e-06
## s(volatile_acidity).3 s(volatile_acidity).4
## -1.351067e-05 -1.793403e-05
## s(volatile_acidity).5 s(volatile_acidity).6
## 1.458357e-05 1.751913e-05
## s(volatile_acidity).7 s(volatile_acidity).8
## -1.271810e-05 1.597256e-04
## s(volatile_acidity).9 s(Log_sulphates).1
## -1.701127e-01 -7.829302e-03
## s(Log_sulphates).2 s(Log_sulphates).3
## -7.099365e-02 4.537419e-02
## s(Log_sulphates).4 s(Log_sulphates).5
## -3.268186e-02 -4.713404e-02
## s(Log_sulphates).6 s(Log_sulphates).7
## 1.907383e-02 1.687056e-02
## s(Log_sulphates).8 s(Log_sulphates).9
## 1.915352e-01 1.288852e-01
## s(Log_chlorides).1 s(Log_chlorides).2
## 8.907392e-06 -3.929104e-07
## s(Log_chlorides).3 s(Log_chlorides).4
## -4.775713e-06 -2.441926e-06
## s(Log_chlorides).5 s(Log_chlorides).6
## 2.867426e-06 5.221761e-07
## s(Log_chlorides).7 s(Log_chlorides).8
## -2.957613e-06 -1.322006e-05
## s(Log_chlorides).9 s(pH).1
## -6.951358e-02 -9.276906e-03
## s(pH).2 s(pH).3
## 1.043393e-02 -9.687645e-03
## s(pH).4 s(pH).5
## -1.755619e-02 2.316811e-03
## s(pH).6 s(pH).7
## 1.786883e-02 9.528146e-03
## s(pH).8 s(pH).9
## 1.380056e-01 -9.893089e-02
## s(Log_total_sulfur_dioxide).1 s(Log_total_sulfur_dioxide).2
## 1.264371e-01 -7.861424e-03
## s(Log_total_sulfur_dioxide).3 s(Log_total_sulfur_dioxide).4
## 3.959153e-02 -2.883118e-02
## s(Log_total_sulfur_dioxide).5 s(Log_total_sulfur_dioxide).6
## -3.644112e-02 3.414453e-02
## s(Log_total_sulfur_dioxide).7 s(Log_total_sulfur_dioxide).8
## 2.681938e-02 1.658594e-01
## s(Log_total_sulfur_dioxide).9 s(citric_acid).1
## 4.615124e-02 -8.294107e-03
## s(citric_acid).2 s(citric_acid).3
## 2.156461e-02 -8.422364e-03
## s(citric_acid).4 s(citric_acid).5
## 1.810441e-02 7.342591e-03
## s(citric_acid).6 s(citric_acid).7
## -2.160911e-02 7.256404e-03
## s(citric_acid).8 s(citric_acid).9
## -1.070232e-01 -3.440155e-02
## s(fixed_acidity).1 s(fixed_acidity).2
## -3.380473e-03 -4.880922e-02
## s(fixed_acidity).3 s(fixed_acidity).4
## -2.163905e-02 -4.395839e-02
## s(fixed_acidity).5 s(fixed_acidity).6
## 1.957378e-02 -4.981481e-02
## s(fixed_acidity).7 s(fixed_acidity).8
## -2.147517e-02 2.926902e-01
## s(fixed_acidity).9
## -3.132917e-02
summary(gam_mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63487 0.01713 328.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 3.799 4.762 54.064 < 2e-16 ***
## s(volatile_acidity) 1.003 1.006 54.789 < 2e-16 ***
## s(Log_sulphates) 3.691 4.634 23.398 < 2e-16 ***
## s(Log_chlorides) 1.000 1.001 11.922 0.000572 ***
## s(pH) 2.252 2.906 5.016 0.001882 **
## s(Log_total_sulfur_dioxide) 2.943 3.724 3.550 0.008809 **
## s(citric_acid) 1.991 2.508 2.440 0.062463 .
## s(fixed_acidity) 2.553 3.260 1.909 0.134935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.404 Deviance explained = 41.3%
## -REML = 1221 Scale est. = 0.37532 n = 1279
gam_mod2 = gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates),
data = train, method = "REML")
summary(gam_mod2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63487 0.01749 322.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 3.958 4.951 60.74 <2e-16 ***
## s(volatile_acidity) 1.001 1.002 88.52 <2e-16 ***
## s(Log_sulphates) 4.191 5.202 19.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.379 Deviance explained = 38.3%
## -REML = 1231.4 Scale est. = 0.39111 n = 1279
plot(gam_mod2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue")
gam.check(gam_mod2)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.529408e-05,1.537895e-06]
## (score 1231.431 & scale 0.3911066).
## Hessian positive definite, eigenvalue range [2.519977e-05,637.5075].
## Model rank = 28 / 28
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(alcohol) 9.00 3.96 1.03 0.845
## s(volatile_acidity) 9.00 1.00 0.96 0.065 .
## s(Log_sulphates) 9.00 4.19 0.97 0.140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
https://noamross.github.io/gams-in-r-course/chapter4
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
train_gam <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_gam
## # A tibble: 1,279 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_gam$nota_vino)
##
## 0 1
## 597 682
Para aplicar GAM logstico a nuestro problema, utilizamos el paquete mgcv y la familia=binomial que indica a la función GAM que nuestra variable respuesta será 0 o 1, es decir, vino bueno o vino malo. Las variables están envueltas por la función s, que es una función que espeficia que queremos que la relación sea flexible.
gam_mod_log = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
s(citric_acid) + s(fixed_acidity), data = train_gam, method = "REML",
family = binomial)
summary(gam_mod_log)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.30934 0.07799 3.966 7.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.846 6.974 143.805 < 2e-16 ***
## s(volatile_acidity) 1.855 2.377 27.641 3.25e-06 ***
## s(Log_sulphates) 2.484 3.169 61.194 < 2e-16 ***
## s(Log_chlorides) 6.015 7.183 18.378 0.0121 *
## s(pH) 2.416 3.119 6.121 0.1115
## s(Log_total_sulfur_dioxide) 3.741 4.706 28.623 2.87e-05 ***
## s(citric_acid) 1.490 1.828 6.633 0.0191 *
## s(fixed_acidity) 2.535 3.226 6.556 0.1200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.371 Deviance explained = 32.2%
## -REML = 642.13 Scale est. = 1 n = 1279
Nos quedamos solo con las variables más significativas (tres ***).
gam_mod_log2 = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = train_gam,
method = "REML", family = binomial)
summary(gam_mod_log2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_total_sulfur_dioxide)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26735 0.07471 3.579 0.000345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.927 7.058 150.80 < 2e-16 ***
## s(volatile_acidity) 1.049 1.095 34.16 < 2e-16 ***
## s(Log_sulphates) 4.078 5.044 54.78 < 2e-16 ***
## s(Log_total_sulfur_dioxide) 3.557 4.487 35.85 5.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.346 Deviance explained = 29.4%
## -REML = 649.89 Scale est. = 1 n = 1279
Nuestro intercept es de 0.26735 y utilizamos la función plogis() para transformar nuestro intercept en una probabilidad.
plogis(0.26773)
## [1] 0.5665355
plogis(coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
Este valor significa que nuestro modelo predice una probabilidad inicial de aproximadamente 57% de un vino ser bueno.
plot(gam_mod_log2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2)[1],
seWithMean = TRUE, col = "purple")
#predict(gam_mod_log2, type="response", se.fit = TRUE)
#plogis(predict(gam_mod_log2, type="link"))
Probar en test?¿
test_gam <- test %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
test_gam
## # A tibble: 320 x 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # ... with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_gam$nota_vino)
##
## 0 1
## 147 173
Explicando los predictores
head(predict(gam_mod_log2, type = "terms"))
## s(alcohol) s(volatile_acidity) s(Log_sulphates) s(Log_total_sulfur_dioxide)
## 1 -0.153936041 0.11651423 -0.54774301 0.15593321
## 2 -0.891143170 0.09074145 -0.24318835 -0.49397254
## 3 0.007793508 -1.25421557 -0.01908907 -0.49397254
## 4 -1.063371315 0.24564085 0.09246652 -0.13861017
## 5 -0.910318840 -0.16603197 0.25861731 0.09886004
## 6 -0.792699732 -0.73698784 -0.68971449 0.37393429
predict(gam_mod_log2, type = "terms")[1, ]
## s(alcohol) s(volatile_acidity)
## -0.1539360 0.1165142
## s(Log_sulphates) s(Log_total_sulfur_dioxide)
## -0.5477430 0.1559332
plogis(sum(predict(gam_mod_log2, type = "terms")) + coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
gam_mod_log2_test = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = test_gam,
method = "REML", family = binomial)
summary(gam_mod_log2_test)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_total_sulfur_dioxide)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2339 0.1332 1.757 0.0789 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 1.000 1.000 25.605 7.28e-07 ***
## s(volatile_acidity) 1.000 1.000 18.279 1.93e-05 ***
## s(Log_sulphates) 2.309 2.948 7.506 0.0627 .
## s(Log_total_sulfur_dioxide) 2.595 3.265 10.146 0.0231 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.257 Deviance explained = 22.3%
## -REML = 180.54 Scale est. = 1 n = 320
plogis(coef(gam_mod_log2_test)[1])
## (Intercept)
## 0.5582201
plot(gam_mod_log2_test, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2_test)[1],
seWithMean = TRUE, col = "purple")
Regresión Lineal Mútiple: Vemos una falta de adecuación y ajuste del modelo de regresión lineal múltiple obtenido. Se observa un modelo con unos residuos que presentan heterocedasticidad (varianza no constante en el modelo - se viola la homocedasticidad) y que además no predice de forma adecuada la variable respuesta o dependiente, en base a las variables explicativas o independientes. Al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Reducción de la Dimensionalidad (PCA y t-SNE):
Regresión Logística:
KNN:
Decision Tree:
Métodos de Ensamble:
Random Forest:
Ajuste de Hiperparámetros:
Clustering (K-means. Jerárquico y HDBSCAN):
GAM: